Learning Time-multiplexed phase-coded apertures for snapshot spectral-depth imaging

Edwin Vargas; Hoover Rueda-Chacón; Henry Arguello

doi:10.1364/OE.501096

1. Introduction

Snapshot imaging techniques have evolved rapidly in recent years, with a primary focus on spectral and depth dimensions. These two quantities are of particular interest since radiation at multiple electromagnetic bands provides information about the object material, while the object location is crucial for 3D-perception tasks and scene understanding [1,2]. The simultaneous acquisition of light reflectance at multiple wavelengths and knowledge of the objects location within a scene enable potential applications in areas such as plant phenotyping [3], material analysis under challenging conditions [4], autonomous driving [5], and biomedical imaging [6].

Conventionally, depth and spectral imaging have been studied as individual problems [1,7,8]. Recent approaches have been proposed to acquire these two quantities simultaneously by combining depth imaging approaches, such as stereo-vision or light field imaging, with spectral cameras [4,9,10]. These approaches, however, are limited by the costs of the spectral camera, their large form factor and complexity of alignment and registration. Alternatively, compressive spectral imagers (CSI) have been combined with a microlens array [11], time-of-flight sensors [12], or phase modulators [13] to encode the spectral-depth (SD) information in a single snapshot.

While extremely efficient for long ranges and usually preferred due to their high resolution, depth approaches that employ external light sources, such as structured light illumination, are highly sensitive to environmental light [14] and, thus, their applicability is restricted to indoor scenes. To reduce the form factor, and to alleviate the requirement of an external illumination, authors in [15] proposed a compact snapshot imaging system constituted of just a single diffractive optical element (DOE) and an RGB sensor.

Recovering SD information jointly from a single snapshot is an extremely ill-posed problem. The ability to accurately recover these quantities mainly depends on the properties of the optical modulation and the employed computational algorithm. Inspired by recent work on time multiplexing coded apertures (TMCA) [16], in this work we propose a new codification strategy to encode SD information in a single snapshot. More precisely, the proposed codification consists of a color-coded aperture (CCA), in conjunction with a time-varying phase-coded aperture, synchronized with a per-pixel coded exposure. The latter refers to the temporal modulation performed by a shutter function where pixels of a sensor do or do not measure light during short periods within the exposure, realizing in turn, a coded exposure. We show that this combination entails a spatially-variant point spread function (PSF) for the same depth in a scene, which, in turn, facilitates the distinguishability, and therefore, a better recovery of the depth information. Further, the selective transmission of specific spectral bands due to the CCA encodes relevant spectral information that is disentangled using a reconstruction algorithm.

Inspired by the state-of-the-art in deep optics [16–19], we propose to learn the TMCA codification with a neural network, in and end-to-end framework, to recover the spectral image and depth map from a single coded measurement. To this end, we describe the image formation model following wave optics, since it considers the physics-based propagation of light and the physical relationship between the spectrum and depth of the incident wavefront [20]. The phase modulation and the shutter function are optimized considering the constraints of the spatial light modulator used in the implementation (i.e. a deformable mirror) and of the sensor realizing the coded exposures [21]. Once the TMCA is learned via simulations, it is deployed and tested in a proof-of-concept prototype to experimentally validate the proposed approach.

2. Time-multiplexed coded aperture (TMCA)

TMCA was introduced recently as an improved optical modulation strategy for compressive spectral and light-field imaging [16]. Building upon this, here we propose a TMCA codification strategy for SD imaging based on the synchronization of a time-varying phase-coded aperture (phase modulator) with a shutter function, in conjunction with a CCA. Phase-coded apertures provide depth-variant PSFs, useful to encode depth in a single snapshot [18,22,23], while the CCA boosts the flexibility of encoding spectral information [24–26]. Here, we exploit these benefits using TMCA and demonstrate that this combination induces a novel codification strategy for snapshot SD imaging.

2.1 General synchronization model

Consider the irradiance $f$, invariant in time, as the quantity to be reconstructed with our proposed SD system. The proposed TMCA consists of two coding stages. The first stage optically encodes $f$ into a field $g$, incident to the sensor, using a time-varying phase-coded aperture $h(t)$. We model $g(t)$ as the response of a linear optical system $\mathcal {O}(\cdot )$ given $f$:

(1)$$g(t) = \mathcal{O}(h(t),f).$$

The proposed codification can generalize to different optical systems, but $\mathcal {O}(\cdot )$ models the physical propagation of light through the optical system and its interaction with the time-varying phase-coded aperture $h(t)$. In our case, $\mathcal {O}(\cdot )$ will be described in the following section.

The second stage further encodes the coded irradiance $g(t)$, as it is integrated by the sensor in a single snapshot with exposure time $\Delta {t}$, thus resulting in the coded exposure $e(t)$. More precisely, consider the sensor plane has $M \times N$ pixels, with rows indexed by $m=0, \ldots, M-1$ and columns by $n=0, \ldots, N-1$. During the exposure time, a per-pixel shutter function $S_{m,n}(t)$ turns the $(m,n)$-th pixel “on” and “off” multiple times, resulting in the coded exposure $e_{m,n}(t)$:

(2)$$e_{m,n}(t) = \int_{t}^{t+\Delta t} S_{m,n}(t')g_{m,n}(t')\,\mathrm{d} t'.$$

In particular, we consider binary shutter functions $S_{m,n}$ defined on $K$ discrete time slots of time $\delta {t}$, such that, $\Delta {t}=K\cdot \delta {t}$. The coded exposure can then be rewritten as

(3)$$e_{m,n} = \sum_{k=0}^{K-1} S_{m,n}^k g_{m,n}^k, \text{ with } S_{m,n}^k \in \{0,1\}^k,$$

where $g_{m,n}^k$ and $S_{m,n}^k$ denote the discretized irradiance incident on the sensor and the shutter function at the $(i,j)$-th pixel in the $k-$th time slot, respectively. These time slots allow synchronizing the phase-coded apertures with the coded exposures.

Alternatively, we can write the discrete model via $\mathbf {e} = \sum _k \mathbf {\boldsymbol {S}}^k\mathbf {g}^k$, where $\mathbf {e}$ and $\mathbf {g}$ are the coded exposure and coded irradiance, in vector form, and $\boldsymbol {S}$ is a diagonal matrix representing the shutter function. Similarly, the encoded irradiance incident into the sensor can be denoted as $\mathbf {g}^k = \boldsymbol {O}^k \mathbf {f}$, where the matrix $\boldsymbol {O}^k$ represents the point spread function PSF depending on the $k$-th phase coded aperture, and $\mathbf {f}$ represents the irradiance in vector form. Using this notation, the forward model of the proposed TMCA codification strategy can be succinctly written as

(4)$$\mathbf{e} = \sum_{k=0}^{K-1} \boldsymbol{S}^k\boldsymbol{O}^k \mathbf{f} = \mathbf{A}\mathbf{f},$$

where $\mathbf {A}$ is the measurement matrix of our imaging system. Note that $\mathbf {e}$ relates to the captured image via the camera response function $\mathcal {R}$, which includes noise and quantization effects, i.e., $\mathbf {y}=\mathcal {R}(\mathbf {e})$. Recovering $\mathbf {f}$ from $\mathbf {y}$ amounts to solving an inverse problem whose performance depends on the algorithm employed, and the measurement matrix $\mathbf {A}$. In the following section, we use this forward model where the irradiance $f$ (and $\mathbf {f}$) spans multiple depths and wavelengths.

2.2 Continuous forward model

The proposed snapshot SD imaging system is depicted in Fig. 1. It consists of a free-form phase mask, with arbitrary thickness profile $h$, in the aperture plane and a CCA on the sensor plane. The distance separating the aperture plane and the sensor plane is denoted as $s$. The incident irradiance to the system is a SD source density $f(x,y,z,\lambda )$ that responds differently to every source point depending on the phase mask height map $h$, depth $z$, and wavelength $\lambda$. Here, we use wave optics to obtain the PSF [20]. To start the mathematical model of the proposed model, first, consider the spherical light wave emitted by a point source at a distance $z$ from the aperture plane. The unit-amplitude complex-valued electric field immediately before the phase mask is given by

(5)$$U_{\mathrm{in}}(x',y') = \mathrm{e}^{ik \sqrt{x'^2+y'^2+z^2}},$$

where $k=2\pi /\lambda$ is the wavenumber. The wave then travels through the phase-mask $h$; the equation governing the field, immediately after, is obtained by multiplying the impinging field $U_{\mathrm {in}}(x',y')$, by $b(x',y')$. This models the delay at each location $(x',y')$ on the phase-mask, induced by its thickness and depending on its refractive index. Then, the phase transformation is given by:

(6)$$b(x',y') = \mathrm{e}^{i\phi_d(x',y')} ,$$

where

(7)$$\phi_d(x',y') = \frac{2\pi \Delta n}{ \lambda} h(x',y'),$$

and $\Delta n$ is the refractive index difference between air and the material of the refractive or diffractive optical element constituting the free-form phase mask. Additionally, we model an aperture by inserting an amplitude function $A(x', y')$ modulating the unit-amplitude field. The aperture blocks all the light outside its opening. Then, the electric field immediately after the phase mask is calculated by multiplying both the amplitude modulation (introduced by the aperture) and the phase modulation (introduced by the free-form lens) with the impinging electric field:

(8)$$U_{\mathrm{out}}(x',y') = A(x',y')\cdot b(x',y')\cdot U_{\mathrm{in}}(x',y').$$

Fig. 1. Schematic setup of the proposed TMCA codification strategy for snapshot SD imaging. The phase mask varies in time and the sensor includes a shutter function. The CCA encodes the spectral information before being integrated into the sensor.

Download Full Size | PDF

Finally, the field propagates a distance $s$ from the phase-mask to the focal plane assuming the transfer function [20]:

(9)$$H_s(k_x,k_y)=\mathrm{e}^{i k s\sqrt{1-(\lambda k_x)^2 - (\lambda k_y)^2}},$$

where $(k_x, k_y)$ are spatial frequencies. This transfer function, in the Fourier domain, produces an electric field in front of the sensor described as

(10)$$p(x^{\prime\prime},y^{\prime\prime},z,\lambda) = U_{\text{FPA}}(x^{\prime\prime},y^{\prime\prime}) =\left|\mathcal{F}^{{-}1}\left[\mathcal{F}\left[U_{out}(x',y')\right]H_s(k_x,k_y)\right]\right|^2,$$

where $\mathcal {F}$ denotes the $2$D Fourier transform. The mathematical model in (10) allows obtaining a 2D PSF for each wavelength ($\lambda$) and depth ($z$) of interest. Based on (10), we can approximate the incident irradiance $g$ as the sum of the different contributions of every point source of the scene. If we consider the scene is composed of a planar object located at a given depth $z$, $g$ can be modeled as the simple convolution of the PSF in (10) and $f$ along the sensor dimensions. However, since natural scenes contain depth variations, the incident irradiance is modeled as a spatially-variant convolution. Here, we consider a two-dimensional manifold in 3D space with local intensity; that is, we consider an object parameterized with spatial coordinates $(x,y)$ and depth coordinate $z(x,y)$, resulting in the irradiance $f(x,y,z(x,y),\lambda )$. For simplicity in the notation, hereafter we denote the irradiance function as $f(x,y,\lambda )$. Thus, the encoded irradiance $g$ in the sensor plane can be approximated as

(11)$$g(x,y,\lambda) = \iint f(x',y',\lambda)p(x-x',y-y',z(x',y'),\lambda) \,\mathrm{d} x' \,\mathrm{d} y',$$

where $p$ is the spatially-variant PSF of the system that varies with depth $z(x', y')$.

We further consider an approximated layered depth model, with $D$ depth planes, that leads to a finite number of PSFs corresponding to each depth plane. Under this approximation, the spatially variant PSF can be linearly separated as

(12)$$p(x-x',y-y',z(x',y'),\lambda) = \sum_{d=0}^{D-1} \omega_d(x',y') p_d(x-x',y-y',\lambda),$$

where $p_d(x,y,\lambda ) = p(x,y,z_d,\lambda )$ corresponds to the PSF at depth plane $z_d$, and $\omega _d$ are binary weights, such that, $\omega _d=1$ if spatial location $(x',y')$ corresponds to the $d$-th depth layer. Note that the linear approximation in (11) can accurately model the response from constant depth regions, but it fails at depth discontinuities. To improve the accuracy of the forward model, [23] proposed a nonlinear differentiable model based on alpha compositing. For simplicity, we stick to the linear model, but a similar conclusion can be derived for the nonlinear model in [23].

With the approximation in (12), the spatially-variant convolution in (11) simplifies to a sum of spatially-invariant convolutions of the form

(13)$$g(x,y,\lambda) = \sum_{d=0}^{D-1}\iint f(x',y',\lambda)\omega_d(x',y')p_d(x-x',y-y',\lambda) \,\mathrm{d} x' \,\mathrm{d} y'.$$

Now considering a time-varying height profile $h$, denoted by $h(t)$, the CCA denoted as $C(x,y,\lambda )$, and the shutter function $S$, the measured coded exposure of the proposed system is given by

(14)$$(x,y,\lambda) \! = \!\!\!\int \!\!\! C(x,y,\lambda)S(x,y,t) \!\! \sum_{d=0}^{D-1} \!\!\iint \!\!\! f(x',y',\lambda)\omega_d(x',y') p(x-x',y-y',z_d,\lambda,t) \,\mathrm{d} x' \,\mathrm{d} y' \,\mathrm{d} t,$$

where $p$ varies in time due to the variations in $h$. After regrouping the time-varying terms, the coded exposure can be succinctly expressed as

(15)$$e(x,y,\lambda) = \iint f(x',y',\lambda)\tilde{p}(x-x',y-y',{z}(x',y'),\lambda) \,\mathrm{d} x' \,\mathrm{d} y',$$

where $\tilde {p}(x-x',y-y',{z}(x',y'),\lambda )$, given by

(16)$$\hspace{2em}\!\!\sum_{d=0}^{D-1} \omega_d(x',y') \!\!\int \!\! p(x-x',y-y',z_d,\lambda,t ) C(x+x',y+y',\lambda)S(x+x',y+y',t) \,\mathrm{d} t$$

is the equivalent spatially-variant PSF realized by the TMCA codification. Note that, compared to the traditional PSF in (12), the proposed TMCA entails a different PSF for every position $(x',y')$, for a constant depth plane in the scene. We leverage this characteristic of the TMCA to enhance depth estimation by sacrificing spatial resolution, an advantage that cannot be achieved in a spatially-invariant system. Furthermore, the synchronized time variations of the phase mask and shutter function increase the expressibility of the equivalent PSFs, thus providing a wider range of possibilities to design the PSF.

2.3 Discrete model

For a given pixel $(m,n)$ in the sensor plane, and considering $K$ discrete time slots, the discrete measured exposure in (15) is given by

(17)$$e_{m,n,\ell} = \sum_i \sum_j f_{i,j,\ell} \tilde{p}_{m-i,n-j,z(i,j),\ell},$$

where

(18)$$\tilde{p}_{m-i,n-j,z(i,j),\ell} = \sum_{d=0}^{D-1} \omega_{d,i,j} \sum_{k=0}^{K-1} p_{m-i,n-j,d,\ell,k} C_{m+i,n+j,\ell}S_{m+i,n+j,k},$$

$i,j$ are indices along the discretized spatial dimensions of $M,N$ samples, $\ell,k$ denote the indices for discretized wavelength and time dimensions, and $L$ and $K$ are the corresponding number of samples along them. According to (18), the number of distinctive PSFs depends on the spatial resolution of the shutter and the CCA, thus, leading to a considerable number of PSFs and model complexity. To reduce this number, we constrain the structure of the shutter and CCA to be periodic with period $Q\ll \{N, M\}$. In this case, the number of all PSFs reduces from $MNDL$ to $Q^2DL$. Moreover, the CCA filters cannot be arbitrary spectral responses in real-world scenarios. To alleviate this constraint, we adopt the methodology in [27], where the spectral response of the CCA filters is constrained to be a linear combination of $R$ feasible filters. Figure 2(a) shows the depth-and-time varying PSFs $p$, the shutter function $S$, and the equivalent depth-variant PSF $\tilde {p}$ in (18), for a point source in a given position $(i,j)$ at the image plane. We recall that the proposed TMCA yields a different PSF for every position, for a constant depth plane in the scene, leading to accurate depth estimation by sacrificing spatial resolution. This characteristic is illustrated in Fig. 2(b) by showing the PSFs of $4$ different points in the image plane, at the same depth $z_d$. In comparison to depth from defocus approaches that encounter difficulties in textureless regions, the proposed TMCA pattern mitigates this ambiguity, associated with depth information, with the spatially-variant response.

Fig. 2. (a) Depth-and-time varying PSFs $p$, shutter function $S$, and equivalent depth-variant PSF $\tilde {p}$ of a point source from a given position $(i,j)$ in the image plane. (b) PSFs of different points at the same depth $z_d$.

Download Full Size | PDF

Further, in an imaging system, the recorded image in a channel $c$ is an integration of the spectral information weighted by the spectral response of the $c$-th sensor channel. Considering a classical image formation model, the recorded image per channel, $y^c$, can be finally expressed as

(19)$$y_{m,n}^c = \sum_\ell e_{m,n,\ell} \kappa^c_{\ell} + w_{m,n},$$

where $\kappa ^c_{\ell }$ represents the $c$-th spectral response of the sensor, at the $\ell$-th spectral band, and $w_{m,n}$ accounts for additive Gaussian noise.

3. End-to-end optimization: joint learning of the TMCA and recovery algorithm

Deep optics introduces the concept of optimizing the optical elements jointly with the reconstruction algorithm in an end-to-end framework, employing stochastic optimization [17,19]. Based on differentiable reconstruction algorithms, this framework allows the optimization of domain-specific computational cameras. Particularly, it has been recently investigated on applications such as extended depth of field [17], high dynamic range imaging [28], spectral imaging [29], depth estimation [18] and many others.

We leverage deep optics framework to learn our TMCA as summarized in Fig. 3. Formally, let $\mathbf {f}$ and $\mathbf {z}$ denote the spectral image and the depth map, in vector form, respectively. Denote the optical imaging system by the operator $\mathcal {A}_\varphi$, where $\varphi$ represents the optical parameters to be optimized ($h$, $S$, and $C$) in the proposed SD system. We parameterize $h$ by using the basis of Zernike polynomials [30]. The computation of the measurements in (19) can be simply represented as $\mathbf {y}=\mathcal {A}_\varphi (\mathbf {f},\mathbf {z})$. To estimate $\mathbf {f}$ and $\mathbf {z}$ from the coded measurements $\mathbf {y}$, a differentiable neural network (decoder) $\mathcal {D}_\psi$ is used. Similarly to the optical encoder, we assume that the decoder can be fully defined by the set of parameters $\psi$. Given pre-trained encoder$-$decoder models, one would capture measurements with the corresponding optical system $\mathbf {y} = \mathcal {A}_\varphi \left (\mathbf {f}, \mathbf {z}\right )$ and then recover the depth and spectral image via $\{\mathbf {{\tilde f}}, \mathbf {{\tilde z}}\}= \mathcal {D}_\psi \left (\mathbf {y}\right )$. The recovery process is commonly referred to as the inference stage. To determine the optimal set of parameters $\varphi,\psi$, the optical-encoder and electronic-decoder are trained jointly in an end-to-end fashion, using a dataset with $P$ ground truth spectral-and-depth image pairs, by minimizing a loss function $\mathcal {L}$

(20)$$\underset{\varphi,\psi}{\text{argmin}} \sum_{i=0}^{P-1} \mathcal{L}\left( \mathcal{D}_\psi \circ \mathcal{A}_\varphi ( \mathbf{f}^i, \mathbf{z}^i ); \mathbf{f}^i, \mathbf{z}^i\right),$$

where $\circ$ denotes the composition of functions and

(21)$$\mathcal{L}\left( \mathbf{{\tilde f}}, \mathbf{{\tilde z}}; \mathbf{f}, \mathbf{z}\right) = \|\mathbf{{\tilde z}} - \mathbf{z}\|_1 + \| \mathbf{{\tilde f}} - \mathbf{f}\|_2 + \| \nabla\mathbf{{\tilde z}}\|_1 .$$

The first two terms in (21) minimize the discrepancy between the reconstructed and the ground truth images, while the last term is a total variation regularization that aims to capture the sharp boundaries of the depth maps, with $\nabla$ being the spatial gradient operator. Note that since both the encoder and decoder are jointly optimized, the optical coding parameterized by $\varphi$ influences the reconstruction algorithm parameters $\psi$ and vice versa. For our SD reconstruction, we use two vanilla U-Nets for depth and spectral recovery, as shown in Fig. 4. We take advantage of the spatially-variant response for a constant depth plane to improve depth estimation by sacrificing spatial resolution. To this end, assuming a periodic structure of the shutter, and a CCA with period $Q$, we rearrange the mosaic patterned measurements to obtain an invariant low-resolution (LR) cube with $Q^2$ depth-channels. This rearrangement is performed through a space-to-depth module (S2D) before the U-Net used for depth estimation. Processing the measurements at lower resolution leads to reduced computational complexity, thereby resulting in an efficient reconstruction algorithm. At the end of the U-Net for depth imaging, we upscale the $Q^2$ low-resolution features into a high-resolution depth map using a depth-to-space (D2S) module. The D2S module is inspired by the sub-pixel convolution layer proposed in [31] for super-resolution imaging.

Fig. 3. Overview of the proposed end-to-end time multiplexed phase coded aperture (TMCA) pipeline. A sequence of phase masks is synchronized, in time, with a coded shutter to generate the proposed spatially-variant PSFs. Then, we simulate the coded snapshot $\mathbf {e}$ and fed it into the deep network $\mathcal {D}_{\psi }$ that estimates the spectral-depth image. The TMCA and deep network parameters ($\varphi$, $\psi$) are optimized via backpropagation.

Download Full Size | PDF

Fig. 4. Sketch of the digital decoder. (Left) Neural network architecture for depth map estimation and spectral image reconstruction. (Right) Space-to-depth (S2D) and depth-to-space modules (D2S). The S2D module rearranges the HR sensor measurement as a LR tensor with $Q^2$ channels, and the D2S module rearranges back the LR features as a HR depth estimate.

Download Full Size | PDF

4. Results

We show the results of the proposed TMCA in simulations and on real acquisitions using a testbed lab prototype.

4.1 Dataset and training details

One important aspect of a learning model is the dataset used to supervise the optimization. Even though there is a plethora of spectral and depth map datasets, there exist very limited SD datasets. In this work, we use a recent benchmark SD dataset proposed in [15]. It consists of $18$ spectral-and-depth map image pairs with a spatial resolution of $2824 \times 4240$ pixels. The spectral images span along $25$ wavelengths from $440$ nm to $680$ nm, with a $10$ nm interval. The depth images varies within $[0.4-2.0]$ m. We randomly select and use $14$ spectral images for training, and $4$ for testing. At training time, we randomly select patches of size $256 \times 256$. At testing time, we use the learned model on non-overlapping patches of size $256 \times 256$ to construct the whole image. We crop the SD image to a spatial resolution of $2816\times 4096$ to obtain an integer number of patches. We set the number of time slots in the TMCA encoder to $K=4$, the number of depths to $D=8$, and the number of feasible color filters to $R=4$. The neural network is trained for $1000$ epochs with a learning rate of $0.001$. Our models are implemented in PyTorch [32] and trained on a Titan RTX GPU using the ADAM optimizer [33].

4.2 Simulation results

We compare the proposed TMCA codification method against two baseline approaches: one that relies solely on DOE-based codification [15], and the other that employs a singlet lens. To obtain the results in [15], we modified our own implementation removing the time-varying elements, the coded shutter, and the CCA. To obtain the results of the second baseline, we also remove the shutter and CCA and set up the phase of a singlet lens in the aperture. Note that for the singlet lens baseline, the optical encoder does not have trainable parameters. As a digital decoder of both baselines, we use the same two vanilla U-Nets of the proposed approach but without the S2D and D2S modules. We evaluate the quality of the spectral reconstruction using the peak signal-to-noise ratio (PSNR), the spectral angle mapper (SAM) metrics [34], and the universal image quality index (UIQI) [35]; the depth map reconstruction is evaluated using the root-mean-square error (RMSE) and the mean absolute error (MAE). Quantitative results evaluated on the test set are presented in Table 1 showing that the proposed TMCA codification performs slightly better than the baseline in terms of spectral performance, while showing superior performance in the depth estimation. Qualitative results of $3$ recovered spectral images, depicted in Fig. 5, show that the proposed method can recover better spectral information (better color matching to the ground truth), while attaining a similar spatial resolution. The corresponding estimated depth maps from scenes used in Fig. 5 are shown in Fig. 6, which demonstrates that the TMCA yields depth maps that attain lower spatial resolution but featuring higher fidelity to the ground truth depth maps. Recovered spectral signatures of 3 different colors in the color checker, shown in Fig. 7, confirm the better recovery of the spectral information. Finally, since we are using the same U-Nets as a digital decoder for both baselines, the quantitative results of Table 1 confirm that the gains cannot be attributed solely to the neural network but indeed are influenced by the optical coding.

Fig. 5. Comparison of the recovered spectral images, mapped to RGB, using our proposed TMCA method and the competitive baselines. Visually, the colors of the proposed approach better resemble the ground truth, demonstrating a better spectral estimation. The PSNR between the recovered and the ground truth images is reported in the lower-right corner of each inset.

Download Full Size | PDF

Fig. 6. Comparison of the estimated depth maps using the proposed TMCA method and the competitive baselines. The estimated depth maps by our proposed approach attain lower spatial resolution but exhibit higher fidelity to the ground truth depth values. The MAE between the estimated and ground truth depth maps is reported in the lower-right corner of each inset.

Download Full Size | PDF

Fig. 7. (a) Comparison of the reconstructed spectral signatures of 3 color patches from the color checker in Fig. 5, between the proposed TMCA and the competitive method [36]. The SAM metric between the estimated and ground truth signatures is reported in the upper-right corner of each inset. (b) Spectral response of the feasible color filters used in the CCA [27]. (c) Spectral response of the $Q^2$ learned filters (with $Q=4$). (d) Spatial distribution of the $Q\times Q$ learned CCA.

Download Full Size | PDF

4.3 Impact of noise and period $Q$

In this experiment, we analyze more realistic scenarios where the coded measurement may be affected by different levels of noise. Particularly, we added Gaussian noise to the measurements with SNR values of $20$, $30$, and $40$ dB. The performance of the proposed approach is reported in Table 2 which shows a degradation in the recovered image quality for increasing noise levels, as expected. However, the method still recovers the spectral information with a PSNR above 30 dB, and the depth information with a similar RMSE for exceptionally noisy scenarios (20 dB). Furthermore, we evaluate the proposed approach for different designs with $Q$ values of 2, 4 and 8. The results reported in Table 3 show a decline in performance when $Q = 8$. This degradation occurs since increasing the value of $Q$ leads to the processing of very low-resolution images, which yield poor quality when reverted to the target resolution. This case can be related to a challenging interpolation problem with high downsampling ratios. We found that $Q=4$ provides a balance between spectral recovery precision and the loss of resolution for depth estimation. Nevertheless, it is important to note that the proposed approach consistently outperforms the competitive methods in the depth estimation task for all selected values of $Q$.

Table 1. Spectral and depth imaging performance in simulation. Comparison of the proposed TMCA method against the baseline [15].

View Table | View all tables in this article

Table 2. Performance of the proposed TMCA method in recovering spectral and depth images for different noisy scenarios.

View Table | View all tables in this article

Table 3. Spectral and depth imaging performance in simulation. Comparison of the proposed TMCA method for different periodic values.

View Table | View all tables in this article

4.3 Real experimentation

We built a proof-of-concept prototype to demonstrate the proposed TMCA for real-world scenes. The prototype, shown in Fig. 8, consists of a reference objective lens (Canon EF 28-80 mm), which focuses the light of the scene at a distance of 40 mm. Employing a pair of 100 mm focal length lenses (achromatic doublet Lens f=100 mm, Thorlabs, AC254-100-A-ML) and a beam splitter (BS) (Thorlabs CCM1-BS013, 30 mm non-polarizing), we generate a 4F system, and modulate the wavefront at 2F using a deformable mirror (DM, actuator piezo, Thorlabs, DMP40-P01-40) with a spectral range of 450-2000 nm. The phase modulation of the DM is dictated through the learned Zernike coefficients that are configured using the official software from Thorlabs. Finally, through the BS and the relay lens, an image is formed on the sensor of a Canon EOS M50 camera. We emulate the CCA, shutter function, and time integration. To emulate the CCA, we illuminate the target scene using a tunable light source (Newport TLS130B) and obtain the spectral band intensity with the gray-scale version of the camera. To emulate the shutter function, the integration time is split into $K$ time slots; for each time slot, we load the $k$-th phase mask in the DM and scan the light source along $25$ spectral channels, thus obtaining $K$ spectral cubes with 25 channels each. Then, we point-wise multiply each of the $k$-th spectral cubes by the binary pattern representing the $k$-th shutter time slot, followed by a filtering step, per wavelength, per pixel, according to the CCA spatial distribution. The resulting $K$ filtered cubes are then added together and spectrally collapsed, thus resulting in our acquired coded snapshot. Once the system is implemented, the performance of the proposed method may be affected by mismatch and modeling errors of the phase modulator. Therefore, we capture the real time-varying PSFs by using a white light point source and calibrate the simulated PSFs with the captured PSFs (see Fig. 9). Then, with the calibrated PSFs, we fine-tune the U-Net networks for 100 epochs, using a learning rate of $10^{-5}$ and the same SD dataset employed in our simulations. Finally, the spectral-and-depth image pair is reconstructed by the fine-tuned networks, using the emulated coded snapshot as input. A summary of the (emulated) coded snapshot, the corresponding recovered spectral image, and the estimated depth map are shown in Fig. 8. This result demonstrates that the proposed approach can recover SD information with high quality in a real-world scenario.

Fig. 8. Proof-of-concept testbed prototype of the proposed TMCA snapshot SD system, along with a real coded acquisition and the corresponding recovered spectral-and-depth information using the optimized decoder. We compare the estimated reflectance of three spectral signatures at points $P_1$, $P_2$, and $P_3$ with reference signatures measured with a spectrometer and report the SAM metric in the lower-left corner of the inset.

Download Full Size | PDF

Fig. 9. (a) Simulated (left) and captured (right) PSFs for 5 different depths (columns) and 5 different wavelengths (rows) in the first time slot ($k=0$). (b) Simulated (left) and emulated (right) equivalent TMCA PSF for a given position of the scene.

Download Full Size | PDF

4.4 Limitations and challenges

The proposed TMCA codification strategy provides an improved reconstruction quality but it has a low light efficiency since it relays on a CCA, and a pixel shutter. Additionally, we assume a static scene and simulate the synchronization, which may pose challenges in real-time applications. In our real experimentation, we emulate the shutter function and CCA, however, these functions can be implemented using focal sensor processors [37] and Ximea xiSpec camera sensors, respectively. Moreover, even though we consider a CCA with just $Q^2 =16$ filters, its fabrication cost may limit its widespread applicability; thus, more affordable solutions that use photographic film [27] can be explored, albeit at the expense of limited spatial resolution.

5. Conclusion

We introduced a time-multiplexed optical modulation strategy for spectral-depth imaging that uses a color-coded aperture, in conjunction with a shutter synchronized with a time-varying phase-coded aperture. We demonstrate that the proposed time-multiplexed coded aperture (TMCA) entails a spatially-variant response for a constant depth in a scene, exploited to enhance the depth estimation by sacrificing spatial resolution. The TMCA was learned in an end-to-end framework, and demonstrated via simulations that we obtain comparable performance for spectral imaging reconstruction and superior fidelity for depth estimation, compared with a baseline approach that learned a single DOE phase-mask. Furthermore, we demonstrated the performance of the proposed TMCA codification with a proof-of-concept implementation that allowed us to recover SD information, with high fidelity, from a single coded exposure.

Funding

Instituto Colombiano de Crédito Educativo y Estudios Técnicos en el Exterior (2022-0716); Ministerio de Ciencia, Tecnología e Innovación (MINCIENCIAS) (2022-0716).

Acknowledgments

This work was supported by ICETEX and MINCIENCIAS through the CTO 2022-0716, Sistema óptico-computacional tipo pushbroom en el rango visible e infrarrojo cercano (VNIR), para la clasificación de frutos cítricos sobre bandas transportadoras mediante aprendizaje profundo, desarrollado en alianza con citricultores de Santander, under Grant 8284.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data may be obtained from the authors upon reasonable request.

References

1. G. A. Shaw and H. K. Burke, “Spectral imaging for remote sensing,” Lincoln Lab. J. 14(1), 3–28 (2003).

2. S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2015), pp. 567–576.

3. H. Liu, B. Bruning, T. Garnett, and B. Berger, “Hyperspectral imaging and 3d technologies for plant phenotyping: From satellite to close-range sensing,” Comput. Electron. Agric. 175, 105621 (2020). [CrossRef]

4. J. Wu, B. Xiong, X. Lin, J. He, J. Suo, and Q. Dai, “Snapshot hyperspectral volumetric microscopy,” Sci. Rep. 6(1), 1–10 (2016). [CrossRef]

5. M. Hansard, S. Lee, O. Choi, and R. P. Horaud, Time-of-flight cameras: principles, methods and applications (Springer Science & Business Media, 2012).

6. A. J. Radosevich, M. B. Bouchard, S. A. Burgess, R. Stolper, B. Chen, and E. M. Hillman, “Hyperspectral in-vivo two-photon microscopy of intrinsic fluorophores,” in Biomedical Optics, (Optica Publishing Group, 2008), p. BWG7.

7. N. Hagen and M. W. Kudenov, “Review of snapshot spectral imaging technologies,” Opt. Eng. 52(9), 090901 (2013). [CrossRef]

8. F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,” IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2015). [CrossRef]

9. M. Yao, Z. Xiong, L. Wang, D. Liu, and X. Chen, “Spectral-depth imaging with deep learning based reconstruction,” Opt. Express 27(26), 38312–38325 (2019). [CrossRef]

10. L. Wang, Z. Xiong, G. Shi, W. Zeng, and F. Wu, “Simultaneous depth and spectral imaging with a cross-modal stereo system,” IEEE Trans. Circuits Syst. Video Technol. 28(3), 812–817 (2016). [CrossRef]

11. W. Feng, H. Rueda, C. Fu, G. R. Arce, W. He, and Q. Chen, “3d compressive spectral integral imaging,” Opt. Express 24(22), 24859–24871 (2016). [CrossRef]

12. H. Rueda-Chacon, J. F. Florez-Ospina, D. L. Lau, and G. R. Arce, “Snapshot compressive tof+ spectral imaging via optimized color-coded apertures,” IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2346–2360 (2019). [CrossRef]

13. M. Marquez, P. Meza, F. Rojas, H. Arguello, and E. Vera, “Snapshot compressive spectral depth imaging from coded aberrations,” Opt. Express 29(6), 8142–8159 (2021). [CrossRef]

14. C. Li, Y. Monno, and M. Okutomi, “Deep hyperspectral-depth reconstruction using single color-dot projection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 19770–19779.

15. S.-H. Baek, H. Ikoma, D. S. Jeon, Y. Li, W. Heidrich, G. Wetzstein, and M. H. Kim, “Single-shot hyperspectral-depth imaging with learned diffractive optics,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 2651–2660.

16. E. Vargas, J. N. Martel, G. Wetzstein, and H. Arguello, “Time-multiplexed coded aperture imaging: Learned coded aperture and pixel exposures for compressive imaging systems,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 2692–2702.

17. V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, “End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging,” ACM Trans. Graph. 37(4), 1–13 (2018). [CrossRef]

18. J. Chang and G. Wetzstein, “Deep optics for monocular depth estimation and 3d object detection,” in Proceedings of the IEEE International Conference on Computer Vision, (2019), pp. 10193–10202.

19. H. Arguello, J. Bacca, H. Kariyawasam, et al., “Deep optical coding design in computational imaging: a data-driven framework,” IEEE Signal Process. Mag. 40(2), 75–88 (2023). [CrossRef]

20. J. W. Goodman, Introduction to Fourier optics (Roberts and Company Publishers, 2005).

21. J. N. Martel, L. Mueller, S. J. Carey, P. Dudek, and G. Wetzstein, “Neural sensors: Learning pixel exposures for hdr imaging and video compressive sensing with programmable sensors,” IEEE Trans. Pattern Anal. Mach. Intell. 42(7), 1642–1653 (2020). [CrossRef]

22. Y. Wu, V. Boominathan, H. Chen, A. Sankaranarayanan, and A. Veeraraghavan, “Phasecam3d–learning phase masks for passive single view depth estimation,” in 2019 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2019), pp. 1–12.

23. H. Ikoma, C. M. Nguyen, C. A. Metzler, Y. Peng, and G. Wetzstein, “Depth from defocus with learned optics for imaging and occlusion-aware depth estimation,” in 2021 IEEE International Conference on Computational Photography (ICCP), (IEEE, 2021), pp. 1–12.

24. H. Arguello and G. R. Arce, “Colored coded aperture design by concentration of measure in compressive spectral imaging,” IEEE Trans. on Image Process. 23(4), 1896–1908 (2014). [CrossRef]

25. C. V. Correa, H. Arguello, and G. R. Arce, “Snapshot colored compressive spectral imager,” J. Opt. Soc. Am. A 32(10), 1754–1763 (2015). [CrossRef]

26. L. Huang, R. Luo, X. Liu, and X. Hao, “Spectral imaging with deep learning,” Light: Sci. Appl. 11(1), 61 (2022). [CrossRef]

27. H. Arguello, S. Pinilla, Y. Peng, H. Ikoma, J. Bacca, and G. Wetzstein, “Shift-variant color-coded diffractive spectral imaging system,” Optica 8(11), 1424–1434 (2021). [CrossRef]

28. C. A. Metzler, H. Ikoma, Y. Peng, and G. Wetzstein, “Deep optics for single-shot high-dynamic-range imaging,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 1375–1385.

29. L. Wang, T. Zhang, Y. Fu, and H. Huang, “Hyperreconnet: Joint coded aperture optimization and image reconstruction for compressive hyperspectral imaging,” IEEE Trans. on Image Process. 28(5), 2257–2270 (2018). [CrossRef]

30. R. J. Noll, “Zernike polynomials and atmospheric turbulence,” J. Opt. Soc. Am. 66(3), 207–211 (1976). [CrossRef]

31. W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 1874–1883.

32. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS-W, (2017).

33. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv, arXiv:1412.6980 (2014). [CrossRef]

34. F. A. Kruse, A. Lefkoff, J. Boardman, K. Heidebrecht, A. Shapiro, P. Barloon, and A. Goetz, “The spectral image processing system (sips)–interactive visualization and analysis of imaging spectrometer data,” Remote. Sensing Environment 44(2-3), 145–163 (1993). [CrossRef]

35. Z. Wang and A. C. Bovik, “A universal image quality index,” IEEE Signal Process. Lett. 9(3), 81–84 (2002). [CrossRef]

36. S.-H. Baek, I. Kim, D. Gutierrez, and M. H. Kim, “Compact single-shot hyperspectral imaging using a prism,” ACM Trans. Graph. 36(6), 1–12 (2017). [CrossRef]

37. Á. Zarándy, Focal-plane sensor-processor chips (Springer Science & Business Media, 2011).

	Spectral metrics			Depth metrics
	PSNR [dB] ( $↑$ )	SAM ( $↓$ )	UIQI ( $↑$ )	MAE ( $↓$ )	RMSE ( $↓$ )
Singlet lens	37.0045	8.6892	0.9387	0.3863	0.5320
DOE [15]	37.1367	8.7454	0.9756	0.3770	0.5167
Proposed	37.9176	8.1457	0.9883	0.2581	0.4034

	Spectral metrics			Depth metrics
	PSNR [dB] ( $↑$ )	SAM ( $↓$ )	UIQI ( $↑$ )	MAE ( $↓$ )	RMSE ( $↓$ )
SNR 20 dB	31.4983	19.9766	0.7489	0.4429	0.5890
SNR 30 dB	33.3971	14.6833	0.8376	0.3240	0.4863
SNR 40 dB	36.6741	9.3918	0.9377	0.2682	0.4288

	Spectral metrics			Depth metrics
	PSNR [dB] ( $↑$ )	SAM ( $↓$ )	UIQI ( $↑$ )	MAE ( $↓$ )	RMSE ( $↓$ )
$Q = 2$	37.3958	8.5944	0.9474	0.2436	0.3813
$Q = 4$	37.9176	8.1457	0.9883	0.2581	0.4034
$Q = 8$	36.1858	9.3760	0.9293	0.3121	0.4639

	Spectral metrics			Depth metrics
	PSNR [dB] ( $↑$ )	SAM ( $↓$ )	UIQI ( $↑$ )	MAE ( $↓$ )	RMSE ( $↓$ )
Singlet lens	37.0045	8.6892	0.9387	0.3863	0.5320
DOE [15]	37.1367	8.7454	0.9756	0.3770	0.5167
Proposed	37.9176	8.1457	0.9883	0.2581	0.4034

	Spectral metrics			Depth metrics
	PSNR [dB] ( $↑$ )	SAM ( $↓$ )	UIQI ( $↑$ )	MAE ( $↓$ )	RMSE ( $↓$ )
SNR 20 dB	31.4983	19.9766	0.7489	0.4429	0.5890
SNR 30 dB	33.3971	14.6833	0.8376	0.3240	0.4863
SNR 40 dB	36.6741	9.3918	0.9377	0.2682	0.4288

Learning Time-multiplexed phase-coded apertures for snapshot spectral-depth imaging

Abstract

1. Introduction

2. Time-multiplexed coded aperture (TMCA)

2.1 General synchronization model

2.2 Continuous forward model

2.3 Discrete model

3. End-to-end optimization: joint learning of the TMCA and recovery algorithm

4. Results

4.1 Dataset and training details

4.2 Simulation results

4.3 Impact of noise and period $Q$

4.3 Real experimentation

4.4 Limitations and challenges

5. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (3)

Equations (21)

Optics Express