## Abstract

Capturing both fine structure and high dynamics of the scene is demanded in many applications. However, such high throughput recording requires significant transmission bandwidth and large storage. Off-the-shelf super-resolution and temporal compressive sensing can partially address this challenge, but directly concatenating the two techniques fails to boost throughput because of artifact accumulation and magnification in sequential reconstruction. In this Letter, we propose an encoded capturing approach to simultaneously increase spatial and temporal resolvability with a low-bandwidth camera sensor. Specifically, we introduce point-spread-function (PSF) engineering via deep optics to encode fine spatial details and temporal compressive sensing to encode fast motions into a low-resolution snapshot. Furthermore, we develop an end-to-end deep neural network to optimize PSF and retrieve high-throughput videos from a compactly compressed measurement. Trained on simulation data and fine-tuned to fit system settings, our end-to-end system offers a $128 \times$ data throughput compared to conventional imaging.

© 2022 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

High-speed high-resolution imaging has become highly desirable in many applications, from microscopic imaging such as physical/chemical phenomenon observation, biological fluorescence imaging to macroscopic imaging such as TV broadcasting, surveillance, and autonomous driving. Directly recording such a large amount of visual data imposes extreme pressure on imaging systems. The acquisition, storage, and processing of these data may result in significant time costs, power consumption, space/memory footprint, and possibly human resource costs. Also, designing and building such equipment is expensive and hard to distribute. Fortunately, computational imaging [1,2] shifts part of the burden from imaging hardware to post-processing algorithms by designing optical modulation setups and jointly developing corresponding algorithmic reconstruction, which has proven effective in high-speed imaging [3–6], super-resolution (SR) imaging, image deblurring, etc.

As a representative of optical modulation components, diffractive optical elements (DOEs) have advantages of compact form factor, a large and flexible design space, and relatively good off-axis imaging behavior, which have led to many simple and lightweight camera designs. The complexity of optics can be simplified by introducing DOEs, which have shown successful implementations in SR [7,8], spectral [9,10], depth [11] and high-dynamic-range (HDR) imaging [12], etc. Combining DOE and point-spread-function (PSF) engineering, researchers have developed an efficient way to optimize optical design together with sensor performance and a reconstruction algorithm in an end-to-end (E2E) fashion [8].

To deal with the dilemma of limited bandwidth in hardware for high-throughput data acquisition, compressive sensing (CS) [13,14] is an effective solution, and with the development of deep-learning [15], back-end reconstruction algorithms for decoding high-dimensional imaging data from compressed measurements have flourished over the years. CS allows image data to be sampled below the Nyquist sampling rate, and by exploiting CS algorithms for reconstruction, many optical setups have been proposed, such as temporal compressive imaging [5,16], spectral compressive imaging [17], and compressed ultrafast photography [18]. Deep learning serves as an E2E solution to inverse problems in imaging and was already successfully applied to CS reconstruction [19]. Therefore, to improve the imaging throughput of optical systems, both the optical setup and its corresponding reconstruction techniques should be considered simultaneously. Inspired by CS, video snapshot compressed imaging (SCI) aims to decode high-speed frames from a single-shot encoded measurement. The underlying principle is to use a group of coding patterns to modulate video frames at a speed higher than the camera capture rate [20].

This Letter proposes a new encoding scheme that takes advantage of DOE and SCI to enable high-speed high-resolution imaging. As shown in Fig. 1, our system captures the target scene with a commercial primary lens (KOWA LM50HC, $f = 50\,\,\rm mm$), and incorporates two modulators in the light path: a DOE phase plate placed at the Fourier plane and fast-changing random binary masks produced by a liquid crystal on silicon (LCoS) at the image plane. The DOE modulates the phase of the incident light and produces a customized PSF at the image plane (shown in the top left of Fig. 1), with a phase pattern of a $2048 \times 2048$ height profile calculated by phase retrieval. The phase plate is fabricated to 3.2 µm pixel size, and the optimized wavelength is 550 nm. Then the PSF-convolved video sequence is relayed and spatiotemporally encoded with an LCoS (ForthDD, QXGA-3DM, $2048 \times 1536$ pixels, 4.5 k refresh rate) to generate the final encoded measurement. We use binary random patterns on the LCoS and use a CMOS sensor (JAI, GO-5000M-USB, $2560 \times 2048$ pixels) to capture the scene. Together with the binning down-sampling process at the sensor, a low-resolution image with a special encoding pattern, used later for (spatial) SR, is produced. The synchronization between the camera and LCoS is achieved by a signal generator. We use two high-quality large-field relay lenses (Chiopt, LS1610A) to pass the image plane to the following optics. Other optical components include a polarization beam splitter (Thorlabs, CCM1-PBS251/M) and polarizers (Daheng Optics, GCL-050003). To the best of our knowledge, these two modulators are adopted simultaneously for the first time, responsible for spatial and temporal encoding of dynamic scenes, respectively. Furthermore, we develop a novel E2E deep network to jointly optimize optical design and the reconstruction module. The network treats the PSF as a convolutional layer whose parameters are trainable and predicts an SR frame sequence from a single low-resolution snapshot measurement. The output data throughput is 128 times the input data throughput, with the pixel (spatial) resolution $4 \times 4$ times higher and the temporal resolution eight times higher.

As shown in Fig. 2, the phase plate is placed at the Fourier plane (front focal plane) of the primary lens and acts as the pupil of the whole system. To model the light propagation, we apply scalar diffraction theory [21] to approximate the paraxial incident wave. The signal before coding mask modulation $I(x,y)$ is expressed as

where ${\cal S}$ is the down-sampling operator corresponding to the physical structure of the CCD or CMOS sensor, $X$ is the latent high-resolution image, ${p_\lambda}$ is the kernel (or PSF) realized by the optical system, and $*$ denotes convolution. In our case, we assume ${\cal S}$ is the binning sampling operator, which sums the photons arriving at all pixels in a region within an upscale_factor × upscale_factor area. The derivation of diffractive model and PSF analysis can be found in Sections 1 and 5 of Supplement 1.In the SCI part implemented by LCoS, $B$ high-speed frames $\{I_{b}\}_{b=1}^{B}\in {\mathbb R}^{n_{x}\times n_{y}}$ are modulated by masks $\{C_{b}\}_{b=1}^{B}\in {\mathbb R}^{n_{x}\times n_{y}}$ correspondingly. The measurement $Y \in {\mathbb R}^{n_{x}\times n_{y}}$ is given by

where $\odot$ denotes the Hadamard (element-wise) product, and $G$ represents noise. We define $s = [s_1^ \top , \ldots ,s_B^ \top]$, where ${s_b} ={\rm vec}({I_b})$, and let ${D_b} ={ \rm diag(vec(}{C_b}))$, for $b = 1, \ldots ,B$, where vec() vectorizes the matrix inside () by stacking the columns, and diag() places the ensued vector into the diagonal elements of a diagonal matrix. This gives us the vector formulation of the sensing process of video SCI: where $\Phi=[D_{1},\ldots,D_{B}],\in {\mathbb R^{n\times n B}}$ is the sensing matrix, with $n = {n_x}{n_y}$, $\{{D_b}\} _{b = 1}^B$ are diagonal matrices, $s \in {\mathbb R}^{nB}$ is the desired signal, and $g \in {\mathbb R}^{n}$ denotes the vectorized noise.Substituting (1) into (2) yields the final encoded measurement:

Afterwards, the measurement $Y$ is fed into a deep network for reconstructing ${X_b}$. Note that, different from existing deep-learning-based networks for SCI reconstruction, we hereby embed ${p_\lambda}$ into our reconstruction network, thus achieving SR high-speed reconstruction based on the low-resolution measurements using an E2E network. In particular, our reconstruction network stems from BIRNAT [22], which uses the bidirectional recurrent neural network as the backbone to exploit the temporal correlation between adjacent frames. Details are shown in Fig. 2. We further integrate an SR module and the PSF module into the base network to construct an E2E model that retains the characteristic of recurrence with high-quality high-resolution output video frames. Details of the proposed E2E model and simulation results are described in Sections 2 and 3 of Supplement 1.

We build a prototype of our high-speed SR imaging framework and conduct systematic calibration of binary modulation masks and engineered PSF before capturing real scenes. We train a base model on simulated data for three weeks on a single NVIDIA RTX 8000 GPU and fine-tune it on real captured data using real calibrated parameters for 2000 iterations (about half an hour). The inference time is of the order of seconds. Details of system calibration and fine-tuning can be found in Section 6 of Supplement 1. We use our prototype to capture some challenging high-speed videos. Figure 3 and Visualization 1, Visualization 2, Visualization 3, Visualization 4, Visualization 5, and Visualization 6 display several reconstructed sequences, demonstrating that our prototype can capture high-resolution high-speed videos with low bandwidth. The full measurement is captured with an exposure time of 42 ms and a spatial resolution of $1280 \times 1280$ pixels, leading to a final $5120 \times 5120$ high-resolution frame sequence at 190 fps. Due to GPU memory limits, the network can only admit $256 \times 256$ patches and output $1024 \times 1024$ images. We divide the whole snapshot measurement image into $5 \times 5$ patches and conduct patch-wise reconstruction. We reserve overlapping areas between adjacent patches to reduce the blocky artifacts. From the reconstruction of the “rotating windmill” clip in Fig. 3, we can see that our setup can capture the high-speed motions of the blade tip and the fine structures printed on the windmill. From the zoom-in regions, we notice that the patterns in the rotating windmill are finely reconstructed and their motions can be clearly observed. Due to the glossy surface of the windmill, specular highlights occur in some regions, and saturated areas appear in the snapshot measurement, as can be seen from Fig. 3. However, our approach can still reconstruct the dynamically varying high-reflection areas plausibly. For more analysis of noise and dynamic range issues, please refer to Section 6 of Supplement 1. Similarly, the proposed approach also shows promising results on the other examples (see Visualization 2, Visualization 3, Visualization 4, Visualization 5, and Visualization 6).

We test the spatial resolution of our setup by capturing a moving resolution chart with and without a phase plate mounted. We apply a low-resolution reconstruction model without PSF and a high-resolution model considering PSF. The results are shown in Fig. 4. From the result, we can see that after reconstruction with our proposed high-resolution E2E model, we can resolve fine details with a low-resolution camera sensor. When reconstructed with the low-resolution model, we can observe that the third peak cannot be reconstructed along the green line. All three peaks merge into one and cannot be resolved along the blue line. Even if resolved, the profile on the red line is not as smooth as the E2E model.

In summary, we have proposed a coded video capture system that can achieve high-throughput video recording at a low bandwidth. The benefits are from an engineered spatiotemporal encoding and an E2E deep reconstruction framework. The highly multiplexed scheme (temporal accumulation and spatial binning) can also address the noise issue in high-speed high-resolution imaging. The proposed approach is promising in developing low-budget cameras in applications that require both high pixel resolution and high frame rates. In the future, more efforts will be needed for miniaturization and efficient reconstruction. We envision that with such cameras, our proposed imaging system is able to capture high-resolution high-speed videos in low-light scenarios such as nighttime photography, surveillance, and autonomous driving.

## Funding

National Natural Science Foundation of China (61931012, 62088102, 62171258); Ministry of Science and Technology of the People’s Republic of China (2020AA0108202).

## Disclosures

The authors declare no conflicts of interest.

## Data availability

Data underlying the results presented in this Letter may be obtained from the authors upon reasonable request.

## Supplemental document

See Supplement 1 for supporting content.

## REFERENCES

**1. **Y. Altmann, S. McLaughlin, M. J. Padgett, V. K. Goyal, A. O. Hero, and D. Faccio, Science **361**, eaat2298 (2018). [CrossRef]

**2. **J. N. Mait, G. W. Euliss, and R. A. Athale, Adv. Opt. Photon. **10**, 409 (2018). [CrossRef]

**3. **D. Reddy, A. Veeraraghavan, and R. Chellappa, in *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)* (IEEE, 2011).

**4. **D. Liu, J. Gu, Y. Hitomi, M. Gupta, T. Mitsunaga, and S. K. Nayar, IEEE Trans. Pattern Anal. Mach. Intell. **36**, 1258 (2014). [CrossRef]

**5. **C. Deng, Y. Zhang, Y. Mao, J. Fan, J. Suo, Z. Zhang, and Q. Dai, IEEE Trans. Pattern Anal. Mach. Intell. **43**, 1380 (2021). [CrossRef]

**6. **Z. Zhang, C. Deng, Y. Liu, X. Yuan, J. Suo, and Q. Dai, Photon. Res. **9**, 2277 (2021). [CrossRef]

**7. **Q. Sun, J. Zhang, X. Dun, B. Ghanem, Y. Peng, and W. Heidrich, ACM Trans. Graph. **39**, 9 (2020). [CrossRef]

**8. **V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, ACM Trans. Graph. **37**, 114 (2018). [CrossRef]

**9. **X. Dun, H. Ikoma, G. Wetzstein, Z. Wang, X. Cheng, and Y. Peng, Optica **7**, 913 (2020). [CrossRef]

**10. **H. Arguello, S. Pinilla, Y. Peng, H. Ikoma, J. Bacca, and G. Wetzstein, Optica **8**, 1424 (2021). [CrossRef]

**11. **J. Chang and G. Wetzstein, “Deep optics for monocular depth estimation and 3D object detection,” in *IEEE/CVF International Conference on Computer Vision (ICCV)* (IEEE, 2019).

**12. **C. A. Metzler, H. Ikoma, Y. Peng, and G. Wetzstein, “Deep optics for single-shot high-dynamic-range imaging,” in *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)* (IEEE, 2020).

**13. **E. J. Candes, J. Romberg, and T. Tao, IEEE Trans. Inf. Theory **52**, 489 (2006). [CrossRef]

**14. **D. L. Donoho, IEEE Trans. Inf. Theory **52**, 1289 (2006). [CrossRef]

**15. **I. J. Goodfellow, Y. Bengio, and A. Courville, *Deep Learning* (MIT, 2016).

**16. **P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady, Opt. Express **21**, 10526 (2013). [CrossRef]

**17. **A. A. Wagadarikar, N. P. Pitsianis, X. Sun, and D. J. Brady, Opt. Express **17**, 6368 (2009). [CrossRef]

**18. **L. Gao, J. Liang, C. Li, and L. V. Wang, Nature **516**, 74 (2014). [CrossRef]

**19. **G. Barbastathis, A. Ozcan, and G. Situ, Optica **6**, 921 (2019). [CrossRef]

**20. **X. Yuan, D. J. Brady, and A. K. Katsaggelos, IEEE Signal Process. Mag. **38**, 65 (2021). [CrossRef]

**21. **J. W. Goodman, *Introduction to Fourier Optics* (Roberts & Company, 2005).

**22. **Z. Cheng, R. Lu, Z. Wang, H. Zhang, B. Chen, Z. Meng, and X. Yuan, in *European Conference on Computer Vision (ECCV)* (Springer, 2020), pp. 258–275.