Depth-assisted calibration on learning-based factorization for a compressive light field display

Yangfan Sun; Yangfan Sun; Zhu Li; Shizheng Wang; Wei Gao

doi:10.1364/OE.469643

1. Introduction

Due to the widespread application of high-dimensional presentations, e.g., light field and point cloud, three dimensional (3D) demonstration has become an area of increased focus due to the development of stereoscopic sensation technologies. Unfortunately, vergence-accommodation conflict (VAC) [1] brought by binocular disparity significantly impacts clients’ experience in 3D displaying and performing, which often causes visual confusion and eye fatigue [1]. In order to manage VAC issue, non-wearable 3D displays have been studied, such as volumetric [2] or holographic display [3–5] and integral imaging [6–8], which provides binocular and motion parallaxes as the reconstruction of 3D scenarios. However, the former suffers from speckle noise and great computational demand, while the latter experiences a large decrease in spatial resolution for its angular dimensionalization. Therefore, compressive light field (CLF) display has been introduced to relieve the aforementioned problems from these glasses-free devices [9–12].

CLF display utilizes multi-layer spatial light modulators (SLMs) to demonstrate decomposed images by non-negative tensor factorization (NTF) [13] from a light field scene, as shown in Fig. 1 (left). In essence, it aims to fuse and assign the scene to the corresponding SLM layers based on its depth cues. The entire process can be seen as an information compression from dense spatio-angular space to sparse spatial planes. It is inevitable to abandon partial contents that might cause serious quality degradation in reconstruction. Therefore, our core purpose in this study is to maximize the utilization and exploitation of the limited display space. Additionally, the decomposition algorithm, NTF, cannot run in real time due to the need for an iterative operation.

Fig. 1. (Left) Diagram of three-layer polarization-based CLF display. (Right) Light field factorization and viewpoint fusion, note that the light beam does not propagate along the polyline, so only the viewpoint fusion on the straight line can be observed.

Download Full Size | PDF

To solve these technical defects, many researchers study this inconsistency with the active calibration by depth [14] or salience map [15,16]. Additionally, as an alternative to the conventional iteration method attempting to utilize the learning-based scheme driven by the subjective and objective quality of scene reconstruction is used instead as speeding up decomposition [17]. Some works [18,19] determine to simplify the initial input of factorization by maintaining the reconstruction performance. However, it is out of their capability to improve light field decomposition comprehensively over NTF, which limits the commercial applications of CLF display.

In this paper, we propose a learning-based factorization for the polarization-based CLF display. To the best of our knowledge, our scheme is the first one that surpasses NTF in reconstruction performance and computational efficiency, simultaneously. Our main contributions are:

• We deploy a dual-guided network structure. It learns the integrity of displayed layer images by NTF as fundamental result and then, it is guided by the distortion between light field images and reconstructed result to further refine reconstruction performance pixel-wisely.
• We leverage the relationship between the field of depth of objects and reconstruction quality by initializing data via depth-assisted calibration (DAC) towards reproducing depth distribution on the display close to the real objects’ locations.
• We introduce the Gauss-distribution-based weighting (GDBW), which significantly improves the overall perceptual performance compared with the baseline. In accordance with varied observing angular locations, our scheme can excellently restore the viewpoints within the corresponding central area.

The rest of this paper is organized as follows. In Section 2, related works on CLF display and NTF algorithm are discussed. Next, Section 3 thoroughly covers the details of the proposed framework. Experimental results are presented in Section 4. Finally, the conclusion is summarized in Section 5.

2. Related works

2.1 Compressive light field (CLF) display

The CLF display, also known as tensor display, includes polarization-based and attenuation-based models [10,12]. Due to the improved optical efficiency [12], we concentrate on the polarization-based model’s refinement. Figure 1 (right) illustrates the diagram of the polarization-based CLF display’s structure. Uniform light rays are generated by the screens’ backlight which goes through diffuse reflection, and is then, intersected with each display layer. The rotation angles of a light ray across each layer are summed up as an overall shifting. The corresponding model function is as below,

(1)$$I_{out} = I_{in} \cdot \sin^2(\Phi_a + \Phi_b + \Phi_c),$$

where $\Phi$ is the symbol of rotation angles. $I_{out}$ and $I_{in}$ represent the intensity of polarized light emission and entry. The pixels on each layer are fused based on the perspective of observation that are located at each intersection from the given viewing directions. It is noted that only the straight viewing angles can be viewed due to the valid propagation path of light beams.

Existing least squares with linear constraints and bounds (LSQLIN) solver can manage the displayed factorization in a relatively accurate manner, but it requires rather high processing complexity. Compared to LSQLIN, simultaneous algebraic reconstruction technique (SART) [20–22] has been used as a fast solver since it can acquire polarization-based consequences at interactive refresh rates, although it does not perform as well in convergence as the previous solver (slightly worse in quality but much faster) [14]. Therefore, we consider SART as the executor of NTF in this paper.

2.2 Optimization of non-negative tensor factorization (NTF)

Many attempts aim at optimizing NTF with respect to reconstruction quality [14–16], processing time [17], and data acquirement [18,19]. These can be classified into two-folder categories in accordance with NTF-based or non-NTF-based methods. Regarding the NTF-based methods, Liu et al. [23] studied the relationship between reconstruction quality and pixels with a lower overload rate. Then, Wang et al. [14] first achieved depth-based optimization on NTF to keep objects within depth bounds located at the corresponding physical layers. Later on, they proposed a salience-guided calibration framework to automatically adjust the reference plane of scenes aiming to include the most salient content [15]. A more precise depth calibration approach was proposed by Zhu et al. [16]. It is a mapping of the dense pixel region as close to the physical SLM layers as possible relying on the depth information of objects. Cao et al. [18] provided better initial values for expediting the iterative decomposition. In addition, Chen et al. [24] introduced an eye-tracking device to biasedly manage the reconstruction of each viewpoint with the viewing-position-dependent weight distribution.

For the non-NTF-based types, Maruyama et al. [25] offered an optimized acquisition process by adopting a coded-aperture camera. With the increase of acquired images as input, the reconstruction quality can be improved through a CNN-based algorithm. Takahashi et al. [19] developed an alternative iteration method to replace sub-aperture images with a few focused images while maintaining the quality of outputs. Moreover, a learning-based light field decomposition on multiplicative and additive models of CLF displays was proposed by Maruyama et al. [17] to achieve faster execution with satisfying accuracy.

Unfortunately, NTF-based schemes require lengthy execution, while non-NTF-based schemes suffer from quality degradation. None of them has simultaneously outperformed NTF in reconstruction quality and efficiency without additional external devices.

3. Algorithm

Generally, the factorization algorithm is used to decompose the complete array of light field images into multiple SLM layers toward a 3D scenario reconstruction with a limited Field of View (FoV) through physical viewpoint fusion. In essence, it compresses the dense light field to fewer depth visual components which is an ill-posed problem, especially due to decreasing the number of SLM layers [15]. However, industrial-technological constraints make it impossible to achieve a high-number display screen structure. As a result, the reconstructed consequence tends to have blurriness, ghost effects, and dislocation among different viewpoints, especially in edge positions angularly.

In this paper, we develop a dual-guided learning-based factorization, which allows reconstructing high-quality angular components from depth components. Its unifying framework is illuminated in Fig. 2. Here, we broadly introduce it following three fractions, including DAC, feature extraction, and dual-guided convergence.

Fig. 2. Architecture of the proposed network. In general, it can be divided into two learning pipelines, which are the extraction of calibrated and pixel-wise features. Then, guided by the dual objective functions, the network can further provide a refined result based on the initial decomposition. Eventually, GDBW on refinement loss achieves a biased viewpoint reconstruction, aiming at the dynamic reconstruction concentration following observing positions. The explanation of symbols is listed in the right-bottom solid-line bar.

Download Full Size | PDF

3.1 Depth-assisted calibration (DAC)

Theoretically, the depth range of displaying content has a bound limitation due to the size of intervals between devices’ outer layers and the pixel pitch of screens [22]. The upper displaying bound of field of depth cannot be twice the intervals of adjacent screens in a three-layer structure at pixel pitch = 0.6 mm. Thus, CLF display inevitably has a high possibility of failing to demonstrate the complete field of depth in a practical scene resulting in information loss and poor reconstruction. Previous works refer to a fixed planar as the depth center of content for the iterative-based factorization [10,26–28]. Its corresponding objective function is

(2)$$\mathop{\arg\min}_{\mathcal{L}_f, \mathcal{L}_m, \mathcal{L}_r} \ {\lvert \lvert \ \mathcal{T} - \mathcal{W} * [\mathcal{L}_f, \mathcal{L}_m, \mathcal{L}_r] \ \rvert \rvert}^{2},$$

where $\mathcal {T}$ is the ground truth of target scene. The corresponding images for the front, middle, and rear screen are symbolized as $\mathcal {L}_f$, $\mathcal {L}_m$, and $\mathcal {L}_r$. $\mathcal {W}$ denotes the pre-defined transforming matrix [14]. Its element is simply a 0 or 1 to project data from dense spatial representations to depth components. However, this manner does not consider the relative position of SLM layers and the reference plane that significantly vary the reconstruction performance.

In order to attenuate the detriment of missing information during light field decomposition, Wang et al. [15] optimized the generation of salience results to calibrate position changes between objects and SLM layers. Different from fixed reference planar, they aim at satisfying the maximal depth density within the given displayed range,

(3)$$\mathop{\arg\min}_{\mathcal{L}_f, \mathcal{L}_m, \mathcal{L}_r} \ {\lvert \lvert \ C(\mathcal{T}) - \mathcal{W} * [\mathcal{L}_f, \mathcal{L}_m, \mathcal{L}_r] \ \rvert \rvert}^{2},$$

where $C(\cdot )$ operates the salience mapping. Essentially, this adjusts the objects dynamically to the depth coordinate containing the major salient information.

Afterwards, Zhu et al. [16] improved the calibrating accuracy on the reference planar by minimizing the depth distance of each adjusted pixel to the nearest SLM layer. A better consequence can be revealed when the majority of pixels are precisely located on or close to the SLM layers. Its formula is

(4)$$\mathop{\arg\min}_{\mathcal{L}_f, \mathcal{L}_m, \mathcal{L}_r} \ {\lvert \lvert \ \mathcal{S}(\mathcal{T}) - \mathcal{W} * [\mathcal{L}_f, \mathcal{L}_m, \mathcal{L}_r] \ \rvert \rvert}^{2},$$

while,

(5)$$if \ \mathop{\arg\min}_{\mathcal{T}} \sum_{b}^{B} \mathcal{W}_{b} \cdot E_{b} \cdot \mathcal{K}(D_{\mathcal{T}})_{b} , \ then \ \mathcal{S}(\mathcal{T}) = \mathcal{T},$$

where $\mathcal {K}(D_{\mathcal {T}})_{b}$ is the pixel density in $b$-th bin of histogram of depth map $D_{\mathcal {T}}$. $\mathcal {W}_{b}$ is a discriminator that used to identify which physical SLM layer is the closest one to the $b$-th bin. $E_{b}$ is the corresponding Euclidean distance to the nearest SLM layer.

Since aforementioned methods have proved the validity of depth cues in the content-independent situation, we decided to investigate their availability in the learning-based architecture. Motivated by [14], we perform DAC on the central sub-aperture image to weight each pixel following the initialized depth coordinates. Then, the depth cues can be transferred to other viewpoints by concatenating the weighted and pixel-wise features. This ensures the target scene content distributes at the corresponding screens. Simply put, close-range objects are shown on the front displayed layer, while distance views can be seen clearly on the rear one. Here, the formula of the proposed framework is

(6)$$\mathop{\arg\min}_{\delta{(\mathcal{I}_{u,v,s,t}; \mathcal{D}_c)}} \ {\lvert \lvert \ \mathcal{T} - \mathcal{W} * \delta{(\mathcal{I}_{u,v,s,t}; \mathcal{D}_c)} \ \rvert \rvert}^{2},$$

where $\delta (\cdot )$ is the learning-based operation. $\mathcal {I}_{u,v,s,t}$ denotes the light field images. By calibrating the complete scene through the central depth map $\mathcal {D}_c$, the displayed layer images can be generated without iteratively changing the location of objects since the displayed screens are fixed.

3.2 Feature extraction

We denote $\mathcal {I} \in \mathbb {R}^{U \times V \times S \times T}$ as the light field scene since $U$ and $V$ represent the number of horizontal and vertical angular resolutions, $S$ and $T$ are the height and width in spatial dimensions. First, we calculate the depth-based weighting maps $\mathcal {W}_{c;f}$, $\mathcal {W}_{c;m}$, and $\mathcal {W}_{c;r}$ to restrict the feature maps by the corresponding range of field of depth $\mathcal {D}_c$, which is acquired by real capturing or pre-trained estimation [29],

(7)$$\mathcal{W}_{c;f} = \begin{cases} 1, \ \mathcal{D}_c > \mathcal{B}_{u} \\ 0, \ else \\ \end{cases}$$

(8)$$\mathcal{W}_{c;m} = \begin{cases} 1, \ \mathcal{D}_c \in [\mathcal{B}_{l}, \mathcal{B}_{u}] \\ 0, \ else \\ \end{cases}$$

(9)$$\mathcal{W}_{c;r} = \begin{cases} 1, \ \mathcal{D}_c < \mathcal{B}_{l} \\ 0, \ else, \end{cases}$$

where

(10)$$\begin{cases} {B}_{l} = 0.33 \cdot (\mathcal{D}_{c;max} - \mathcal{D}_{c;min}) + \mathcal{D}_{c;min} \\ {B}_{h} = 0.33 \cdot (\mathcal{D}_{c;max} - \mathcal{D}_{c;min}) + 2 \cdot \mathcal{D}_{c;min}. \end{cases}$$

These weighting maps multiply the central view $\mathcal {I}_{c}$ individually to calibrate the precise content information in each depth interval which is assigned to display on each SLM layer. Afterwards, the calibrated views are concatenated to fill in the depth convolutional layer $\mathcal {H}_{depth}$ to train the network parameters $\phi _{depth}$,

(11)$$\mathcal{F}_{depth} = \mathcal{H}_{depth} (\mathcal{\sigma}[\mathcal{I}_{c} \cdot \mathcal{W}_{c;f}, \mathcal{I}_{c} \cdot \mathcal{W}_{c;m}, \mathcal{I}_{c} \cdot \mathcal{W}_{c;m}]|_{\phi_{depth}}),$$

where $\mathcal {\sigma }$ is the operation of concatenation. $\mathcal {F}_{depth}$ is the pixel features calibrated by the initial depth maps. We introduce DAC in order to ensure the coherence between actual depth distribution and hierarchical displayed structure. Thereafter, it involves depth information, which allows to avoid wrecking pixel consistency of shallow features.

As the other pipeline of network (extraction of pixel-wise features), the entire sub-aperture array $\mathcal {I}$ is used as input to extract initial features $\mathcal {F}_{init}$ by the convolutional layer $\mathcal {H}_{init}$,

(12)$$\mathcal{F}_{init} = \mathcal{H}_{init} (\mathcal{I} |_{\phi_{init}}),$$

where $\phi _{init}$ denotes as the trainable parameters of $\mathcal {H}_{init}$.

It is noted that we vectorize $\mathcal {I}$ in angular domain to guarantee the dimension consistency between the size of the input and convolutional features with less computational complexity. Then, the residual convolutional layer is deployed to learn the deep residual features,

(13)$$\mathcal{F}_{res,i} = \begin{cases} \mathcal{H}_{res,i} (\mathcal{F}_{init}|_{\phi_{res,i}}), \ i > 0 \\ \mathcal{H}_{res,i} (\mathcal{F}_{res,i}|_{\phi_{res,i}}), \ i = 0, \end{cases}$$

where $\mathcal {H}_{res,i}$ represents the $i$-th residual block. $\mathcal {F}_{res,i}$ and $\phi _{res,i}$ are the corresponding features and trainable parameters of $i$-th block.

Eventually, we concatenate $\mathcal {F}_{depth}$ and $\mathcal {F}_{res,i}$ to fill them in the convolutional layer $\mathcal {H}_{tail}$ with trainable parameters $\phi _{tail}$,

(14)$$[\mathcal{\hat{L}}_f, \mathcal{\hat{L}}_m, \mathcal{\hat{L}}_r] = \mathcal{H}_{tail} (\mathcal{\sigma}[\mathcal{F}_{res,i}, \mathcal{F}_{depth}]|_{\phi_{tail}}),$$

where $\mathcal {\hat {L}}_f$, $\mathcal {\hat {L}}_m$, and $\mathcal {\hat {L}}_r$ are the reconstructed front, middle, and rear displayed layer images, respectively. They are used to recover sub-aperture views $\mathcal {\hat {I}}$ by the polarization-based viewpoint fusion [12]. In practice, we can observe multiple viewpoints through CLF displays after deploying them to each SLM layer.

3.3 Objective function

Dual-guided objective function The prior learning-based method [17] is merely converged by the distortions of viewing images. This quality-driven approach is not good enough since the immersive performance not only depends on how well the scene can be recovered. It also needs to follow and simulate the practical depth distribution restrictively. Otherwise, viewers might observe close-range objects on the rear layer that would surely impact the viewing experience. Therefore, instead of training from scratch, we first attempt to learn the distortion of viewing images as visualization loss $\ell _{v}$,

(15)$$\ell_{v} = \sum_i^I \sum_s^S \sum_t^T \Vert \mathcal{L}_{i}(s,t) - \mathcal{\hat{L}}_{i}(s,t) \Vert ^2, \ I \in [f, m, r].$$

Furthermore, we improve the performing quality of the previous result $\mathcal {L}$ while maintaining the fundamental integrity of $\mathcal {\hat {L}}$ by leveraging the quality-driven system as pixel-wise adjustment. It measures the distortion of sub-aperture images $\mathcal {I}$ as refinement function $\ell _{r}$,

(16)$$\ell_{r} = \sum_u^U \sum_v^V \sum_s^S \sum_t^T \Vert \mathcal{I}_{u,v}(s,t) - \mathcal{\hat{I}}_{u,v}(s,t) \Vert ^2.$$

$\lambda _v$ = 0.99 and $\lambda _r$ = 0.09 are the best combinations of reconstruction performance to weight each loss accordingly, and then sum up as the overall objective function,

(17)$$\ell_{total} = \lambda_v \cdot \ell_{v} + \lambda_r \cdot \ell_{r}.$$

Gauss-distribution-based weighting (GDBW) Viewers cannot observe the complete array of viewpoints simultaneously due to the directional property of light beams. Therefore, it is unnecessary to restore the entire array in a balanced manner for the singular observer since the edge viewpoints can barely be seen [24]. A better approach is to concentrate on the high-quality recovery of observing center, which requires a frequent relocation along with the movement of this observer. However, the conventional evenly weighting distribution can only ensure the maximization of overall quality improvement. Inspired by the application of unbalanced distribution in the iterative method [24], we propose GDBW on our refinement objective function. It achieves unevenly optimized reconstruction depending on viewing position. Besides, it also ensures a smooth quality fluctuation among different viewpoints that help to avoid a negative perceptual impact from visualized inconsistency. Its formula is

(18)$$\lambda_{u,v;u_0,v_0} = \frac{1}{2\pi} \exp({-}0.5 \cdot (u-u_0)^2) \cdot \exp({-}0.5 \cdot (v-v_0)^2),$$

where $\{u_0,v_0\}$ indicates the angular index of observing central view, while $\{u,v\}$ is the index of particular angular view. We define that elements of this unevenly weighting depend on the distance between target view and observing center. In this case, the total loss is

(19)$$\ell_{total,Gauss} = \ell_{r,Gauss} + \lambda_v \cdot \ell_{v} + \lambda_r \cdot \ell_{r},$$

when

(20)$$\ell_{r,Gauss} = \sum_{u}^{U} \sum_{v}^{V} \sum_{s}^{S} \sum_{t}^{T} \lambda_{u,v;u_0,v_0} \Vert \mathcal{I}_{u,v}(s,t) - \mathcal{\hat{I}}_{u,v}(s,t) \Vert ^2.$$

It is noted that we need to retrain our model at different angular positions. The execution of retraining is an off-line process with no need for more computational costs during practical demonstrations.

4. Experiments

In this paper, we collect synthetic and real-world light field scenes from public datasets [30–32] to ensure data variety with respect to parallax disparities and scene content. These samples are unified to remain 7$\times$7 angular dimensions, and then normalized within the range of (0, 1). For training, we execute data enhancement by cropping them into 96$\times$96 patches stochastically. The open-source light field synthesis tool [33] is used for the implementer of NTF as the baseline. We set SART as the solver mode and 50 as the number of iterative steps to reach the best possible reconstruction performance. Besides, to verify the efficiency of the proposed method, we also compare the CNN-based factorization [17].

Note that some other competitive methods cannot be reproduced because they require the particular acquired devices [24,25] or they are based on iterative proceedings for the optimal position [15,16].

PSNR (Peak Signal to Noise Ratio) and LPIPs [34] are adopted as evaluative metrics. The details of hyper-parameters are given in Table 1. It approximately takes 2 hours to train the network with 1,000 patches through an Nvidia RTX 3090 GPU.

Table 1. Configurations of hyper-parameters for training.

View Table | View all tables in this article

4.1 Evaluation w.r.t. viewpoint reconstruction

Table 2 demonstrates the viewpoint reconstruction performance on 10 synthetic and real-world scenes with different spatial resolutions, quantitatively. An obvious improvement can be observed by using the proposed method to the baseline that increases by approximately 1.7 dB in PSNR and by 0.10 drops in LPIPs on average. It also outperforms the CNN-based method showing a rise of 2.56 dB in PSNR and a decrease of 0.11 in LPIPs on average. Furthermore, the proposed method improves the reconstruction performance in the aspect of maximal (center) and minimal (edging) viewpoints compared to other methods. This indicates that comprehensive optimization can be achieved using the proposed method in viewpoint reconstruction.

Table 2. Quantitative comparisons in the performance of viewpoint reconstruction by the composition of displayed layer images. Red denotes the best performance, while blue is the second-best performance in average, minimum (edging) and maximum (center) aspects, respectively.

View Table | View all tables in this article

By deploying GDBW with the observing center $\{u_0=0,v_0=0\}$, the proposed method experiences a significant perceptual improvement showing a decline of 0.03, 0.01, and 0.11 on average, on the edge viewpoints and on the central viewpoint, respectively. Besides, it can achieve the second-best objective performance compared to other methods.

Subjectively, we visualize the zoom-in patches of reconstructed results by each method as shown in Fig. 3. We examplify some common artifacts in various angular coordinates, such as blurriness in Fig. 3(d,e,n,o), malposition in Fig. 3(i,j), and ghost effects in Fig. 3(s,t). The proposed method perfectly alleviates these subjective issues by providing a more clear and stable vision.

Fig. 3. Visualization of viewpoint reconstruction at arbitrary angular coordinates. (a, f, k, p) Ground truth of complete view; (b, g, l, q) Ground truth of zoom-in patches; (c, h, m, r) reconstruction patches by the proposed method; (d, i, n, s) reconstruction patches by the CNN-based method [17]; (e, j, o, t) reconstruction patches by the baseline [33].

Download Full Size | PDF

4.2 Polarization-based displayed layers

In general, by deploying each displayed image in the corresponding SLM layer, we can observe different viewpoints from various viewing locations in practical. The polarization-based viewpoint fusion is used as algorithm-level composition in each method. In this case, the improvement of viewpoint reconstruction mainly originates from the optimization of displayed images. Therefore, we demonstrate the displayed images in Fig. 4, by cropping a patch from each layer and zooming in for a better visual comparison. As it is known, there is no ground truth for displayed images. Therefore, we evaluate their performance based on the pixel density of content from the corresponding depth displayed on the correct SLM layer [15,16]. Clear and accurate zoom-in patches can be viewed by using the proposed method compared to the baseline, especially in the front and rear SLM layers. It interprets that the proposed method can achieve a more suitable result in simulating the practical depth distribution.

Fig. 4. Visualization of displayed images for each SLM layer and the corresponding zoom-in areas for a clear comparison.

Download Full Size | PDF

4.3 Evaluation w.r.t. computational efficiency

In addition to the evaluation of reconstruction quality, we also compare the time consumption represented in Table 3, which is a critical factor to determine the practicability of displayed factorization. It is worth noting that we record the processing time and PSNR of arbitrary steps between 5 and 50 in executing the baseline. It shows the inference time required by the proposed method is much lower than the baseline. Moreover, the proposed method achieves a better quality performance than the CNN-based method in a relatively similar period of inference time. Although the CNN-based method takes less inference time in decomposition than the baseline does, it cannot produce better quality in general. Therefore, this illustrates the superiority of the proposed method, which accomplishes a real-time high-quality decomposition for light field scenes.

Table 3. Comparisons of computational efficiency in $Blue\_room$ scene [32].

View Table | View all tables in this article

4.4 Ablation study

In this section, we analyze the ablation study in order to verify the effectiveness of DAC and GDBW deployments. Figure 5 demonstrates the reconstruction performance of each individual view by using the proposed method in different configurations with respect to "w/o DAC", "with DAC", and "with DAC and GDBW", compared to the baseline. We can observe an obvious quality improvement in the configuration of "with DAC" over "w/o DAC". Furthermore, the configuration of "with DAC and GDBW" can perform outstanding quality results at the observing center $\{u_0=0,v_0=0\}$ and its neighboring viewpoints compared to other configurations and the baseline.

Fig. 5. Heatmaps of objective performance measured by PSNR for each viewpoint of $Grids$ scene (first row), $Coffee\_time$ scene (second row), and $Bottles$ scene (third row) [32]. These scenes are from Inria dataset [32].

Download Full Size | PDF

Besides, we verify the efficiency of GDBW by retraining the model in different observing centers. Figure 6 visualizes the zoom-in patches from the center view $\{u=0,v=0\}$ with or w/o GDBW, while observing at $\{u_0=0,v_0=0\}$ and $\{u_0=-2,v_0=-2\}$. With the comparison to evenly weighting, we can view the advantage of GDBW in subjective performance, e.g., a clear "cup handle" can be seen in the model of varied observing positions.

Fig. 6. Visualized reconstruction of $\{u=0, v=0\}$ view by the proposed method with and w/o configuring GDBW on two different observing positions: $\{u_0=0,v_0=0\}$ (red), and $\{u_0=-2,v_0=-2\}$ (blue).

Download Full Size | PDF

4.5 Prototype

In order to reveal the practicality of our optimization, we assemble a 15.4-inch prototype, as shown in Fig. 7. Our prototype employs a three-layered SLM foundational structure driven by an external uniform backlight in industrial power that achieves its observable brightness in daylight. Figure 8 demonstrates the displayed results of light field scenes from the public datasets [32] by using the baseline and our method with GDBW ($\{u_0=0,v_0=0\}$). In a practical demonstration, we can view the outstanding subjective improvement of the proposed method over the baseline.

Fig. 7. Hardware Implementation of CLF Display

Download Full Size | PDF

Fig. 8. Real image capture at the angle of bottom left (first column) and center (second column) from our prototype factorized by the baseline [33] and our method with GDBW.

Download Full Size | PDF

5. Conclusion

In this paper, we present an accurate and fast learning-based factorization on three-layer polarization-based architecture. This method is the first to demonstrate the possible commercial application of glasses-free CLF display. In detail, it leverages the dual-guided training process by achieving a pixel-wise refinement. Then, DAC is employed to further arrange the depth range of displayed images following the actual depth distribution that simulates a 3D scenario with a limited field of view. Eventually, we involve GDBW in the objective function to unevenly reconstruct the viewpoint array located at or close to the observing locations. In general, our method significantly increases the quantitative reconstruction performance and mitigates the negative impacts of reconstruction artifacts. More importantly, the optimized consequences can be viewed clearly through our assembled prototype over the previous state-of-the-art.

Funding

National Science Foundation (1747751, 2148382).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in [30–32].

References

1. D. M. Hoffman, A. R. Girshick, K. Akeley, and M. S. Banks, “Vergence–accommodation conflicts hinder visual performance and cause visual fatigue,” J. vision 8(3), 33 (2008). [CrossRef]

2. C. Yan, X. Liu, D. Liu, J. Xie, X. X. Xia, and H. Li, “Omnidirectional multiview three-dimensional display based on direction-selective light-emitting diode array,” Opt. Eng. 50(3), 034003 (2011). [CrossRef]

3. Z. Wang, G. Lv, Q. Feng, A. Wang, and H. Ming, “Simple and fast calculation algorithm for computer-generated hologram based on integral imaging using look-up table,” Opt. Express 26(10), 13322–13330 (2018). [CrossRef]

4. Z. Wang, G. Lv, Q. Feng, A. Wang, and H. Ming, “Resolution priority holographic stereogram based on integral imaging with enhanced depth range,” Opt. Express 27(3), 2689–2702 (2019). [CrossRef]

5. D. Wang, C. Liu, C. Shen, Y. Xing, and Q.-H. Wang, “Holographic capture and projection system of real object based on tunable zoom lens,” PhotoniX 1(1), 6–15 (2020). [CrossRef]

6. W.-X. Zhao, Q.-H. Wang, A.-H. Wang, and D.-H. Li, “Autostereoscopic display based on two-layer lenticular lenses,” Opt. Lett. 35(24), 4127–4129 (2010). [CrossRef]

7. Q.-H. Wang, C.-C. Ji, L. Li, and H. Deng, “Dual-view integral imaging 3d display by using orthogonal polarizer array and polarization switcher,” Opt. Express 24(1), 9–16 (2016). [CrossRef]

8. H.-L. Zhang, H. Deng, J.-J. Li, M.-Y. He, D.-H. Li, and Q.-H. Wang, “Integral imaging-based 2d/3d convertible display system by using holographic optical element and polymer dispersed liquid crystal,” Opt. Lett. 44(2), 387–390 (2019). [CrossRef]

9. G. Wetzstein, D. Lanman, M. Hirsch, W. Heidrich, and R. Raskar, “Compressive light field displays,” IEEE Comput. Grap. Appl. 32(5), 6–11 (2012). [CrossRef]

10. G. Wetzstein, D. R. Lanman, M. W. Hirsch, and R. Raskar, “Tensor displays: compressive light field synthesis using multilayer displays with directional backlighting,” (2012).

11. G. Wetzstein, D. Lanman, W. Heidrich, and R. Raskar, “Layered 3d: tomographic image synthesis for attenuation-based light field and high dynamic range displays,” in ACM SIGGRAPH 2011 papers (2011), pp. 1–12.

12. D. Lanman, G. Wetzstein, M. Hirsch, W. Heidrich, and R. Raskar, “Polarization fields: dynamic light field display using multi-layer lcds,” in Proceedings of the 2011 SIGGRAPH Asia Conference (2011), pp. 1–10.

13. D. Lanman, M. Hirsch, Y. Kim, and R. Raskar, “Content-adaptive parallax barriers: optimizing dual-layer 3d displays using low-rank light field factorization,” in ACM SIGGRAPH Asia 2010 papers (2010), pp. 1–10.

14. S. Wang, Z. Zhuang, P. Surman, J. Yuan, Y. Zheng, and X. W. Sun, “Two-layer optimized light field display using depth initialization,” in 2015 Visual Communications and Image Processing (VCIP) (IEEE, 2015), pp. 1–4.

15. S. Wang, W. Liao, P. Surman, Z. Tu, Y. Zheng, and J. Yuan, “Salience guided depth calibration for perceptually optimized compressive light field 3d display,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 2031–2040.

16. L. Zhu, G. Lv, L. Xv, Z. Wang, and Q. Feng, “Performance improvement for compressive light field display based on the depth distribution feature,” Opt. Express 29(14), 22403–22416 (2021). [CrossRef]

17. K. Maruyama, K. Takahashi, and T. Fujii, “Comparison of layer operations and optimization methods for light field display,” IEEE Access 8, 38767–38775 (2020). [CrossRef]

18. X. Cao, Z. Geng, T. Li, M. Zhang, and Z. Zhang, “Accelerating decomposition of light field video for compressive multi-layer display,” Opt. Express 23(26), 34007–34022 (2015). [CrossRef]

19. K. Takahashi, Y. Kobayashi, and T. Fujii, “From focal stack to tensor light-field display,” IEEE Trans. on Image Process. 27(9), 4571–4584 (2018). [CrossRef]

20. A. H. Andersen and A. C. Kak, “Simultaneous algebraic reconstruction technique (sart): a superior implementation of the art algorithm,” Ultrason. Imaging 6(1), 81–94 (1984). [CrossRef]

21. A. C. Kak and M. Slaney, Principles of computerized tomographic imaging (SIAM, 2001).

22. D. Lanman, G. Wetzstein, M. Hirsch, and R. Raskar, “Depth of field analysis for multilayer automultiscopic displays,” in Journal of Physics: Conference Series, vol. 415 (IOP Publishing, 2013), vol. 415, p. 012036.

23. M. Liu, C. Lu, H. Li, and X. Liu, “Bifocal computational near eye light field displays and structure parameters determination scheme for bifocal computational display,” Opt. Express 26(4), 4060–4074 (2018). [CrossRef]

24. D. Chen, X. Sang, X. Yu, X. Zeng, S. Xie, and N. Guo, “Performance improvement of compressive light field display with the viewing-position-dependent weight distribution,” Opt. Express 24(26), 29781–29793 (2016). [CrossRef]

25. K. Maruyama, Y. Inagaki, K. Takahashi, T. Fujii, and H. Nagahara, “A 3-d display pipeline from coded-aperture camera to tensor light-field display through cnn,” in 2019 IEEE International Conference on Image Processing (ICIP) (IEEE, 2019), pp. 1064–1068.

26. H. Gotoda, “Implementation and analysis of an autostereoscopic display using multiple liquid crystal layers,” in Stereoscopic Displays and Applications XXIII, vol. 8288 (SPIE, 2012), vol. 8288, pp. 71–77.

27. F.-C. Huang, D. P. Luebke, and G. Wetzstein, “The light field stereoscope.” in SIGGRAPH Emerging Technologies (2015), p. 24.

28. W. Liao, S. Wang, M. Sun, P. Surman, Y. Zheng, J. Yuan, and X. W. Sun, “19-5l: Late-news paper: Perceptually optimized dual-layer light field 3d display using a moiré-aware compressive factorization,” in SID Symposium Digest of Technical Papers, vol. 47 (Wiley Online Library, 2016), vol. 47, pp. 235–238.

29. Y.-J. Tsai, Y.-L. Liu, M. Ouhyoung, and Y.-Y. Chuang, “Attention-based view selection networks for light-field disparity estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34 (2020), vol. 34, pp. 12095–12103.

30. K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke, “A dataset and evaluation methodology for depth estimation on 4d light fields,” in Asian Conference on Computer Vision (Springer, 2016), pp. 19–34.

31. M. Rerabek and T. Ebrahimi, “New light field image dataset,” in 8th International Conference on Quality of Multimedia Experience (QoMEX) (2016), CONF.

32. J. Shi, X. Jiang, and C. Guillemot, “A framework for learning depth from a flexible subset of dense and sparse light field views,” IEEE Trans. on Image Process. 28(12), 5867–5880 (2019). [CrossRef]

33. D. G. Dansereau, O. Pizarro, and S. B. Williams, “Decoding, calibration and rectification for lenselet-based plenoptic cameras,” in Proceedings of the IEEE conference on computer vision and pattern recognition (2013), pp. 1027–1034.

34. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition (2018), pp. 586–595.

Parameter	Value	Parameter	Value
Num. of Res blocks (RBs)	16	Batch Size	40
Num. of Conv in RBs	2	$H_{r e s}$	3 $\times$ 3 $\times$ 128
Num. of Conv in $H_{d e p t h}$	3	$H_{d e p t h}$	3 $\times$ 3 $\times$ 64
Num. of Conv in $H_{i n i t}$	2	$H_{i n i t}$	3 $\times$ 3 $\times$ 128
Num. of Conv in $H_{t a i l}$	3	$H_{t a i l}$	3 $\times$ 3 $\times$ 128
Learning Rate	$1 e^{- 4}$	Epoch	200
Optimizer	Adam	Activation	ReLu

PSNR/LPIPs	Baseline [33]			CNN [17]
PSNR/LPIPs	Ave.	Min	Max	Ave.	Min	Max
Blue room	26.68/0.23	22.94/0.27	31.58/0.20	26.69/0.27	23.87/0.31	28.74/0.23
Bottles	29.14/0.20	25.14/0.24	33.09/0.17	27.11/0.29	23.79/0.33	30.40/0.24
Coffee time	21.11/0.30	18.80/0.36	22.33/0.26	23.15/0.27	21.87/0.29	24.28/0.24
Toy bricks	29.94/0.19	25.98/0.23	32.46/0.16	30.46/0.24	27.31/0.28	32.40/0.19
Origami	29.80/0.13	26.15/0.18	33.17/0.10	27.47/0.23	24.58/0.27	30.29/0.19
Bedroom	33.42/0.21	30.68/0.25	35.76/0.18	31.26/0.34	29.40/0.37	32.74/0.30
Grids	22.65/0.29	18.78/0.35	24.10/0.27	22.36/0.35	18.55/0.41	23.82/0.31
Light	28.34/0.21	24.99/0.26	31.02/0.17	27.03/0.32	24.34/0.37	29.15/0.28
Landscapes	36.32/0.18	32.16/0.22	41.28/0.16	34.90/0.35	31.45/0.38	38.19/0.32
Mirrors & Transparency	36.88/0.17	33.19/0.22	40.14/0.14	35.24/0.27	32.23/0.23	37.74/0.24
Ave.	29.43/0.28	25.88/0.26	32.49/0.18	28.57/0.29	25.74/0.32	30.78/0.25
PSNR/LPIPs	Ours			Ours + GDBW
PSNR/LPIPs	Ave.	Min	Max	Ave.	Min	Max
Blue room	28.71/0.19	25.22/0.23	33.31/0.15	28.28/0.15	23.38/0.22	40.69/0.03
Bottles	29.73/0.20	25.68/0.24	32.62/0.10	28.82/0.16	23.63/0.24	41.68/0.03
Coffee time	30.28/0.07	26.89/0.10	33.92/0.06	29.71/0.06	24.76/0.10	40.42/0.01
Toy bricks	32.70/0.15	28.06/0.20	36.06/0.11	32.01/0.12	25.96/0.20	45.26/0.02
Origami	30.39/0.13	26.69/0.17	33.98/0.09	29.82/0.10	24.55/0.17	42.95/0.01
Bedroom	33.59/0.21	30.99/0.25	35.95/0.18	32.97/0.16	28.78/0.22	44.98/0.03
Grids	23.88/0.28	19.40/0.35	25.16/0.23	22.39/0.24	15.93/0.39	35.77/0.05
Light	28.42/0.22	25.26/0.26	30.94/0.18	28.20/0.16	23.19/0.23	39.37/0.04
Landscapes	36.56/0.19	32.55/0.23	41.49/0.16	36.40/0.16	31.48/0.22	46.32/0.06
Mirrors & Transparency	36.99/0.17	33.17/0.23	40.53/0.13	36.56/0.14	32.00/0.21	45.65/0.04
Ave.	31.13/0.18	27.39/0.23	34.40/0.14	30.52/0.15	25.37/0.22	42.31/0.03

	Baseline [33]					CNN [17]	Ours
Time (s)	4.09	6.67	9.77	16.67	24.35	0.87	0.88
PSNR	20.66	25.15	26.14	26.57	26.68	26.69	28.71
LPIPs	-	-	-	-	0.23	0.27	0.19

Parameter	Value	Parameter	Value
Num. of Res blocks (RBs)	16	Batch Size	40
Num. of Conv in RBs	2	$H_{r e s}$	3 $\times$ 3 $\times$ 128
Num. of Conv in $H_{d e p t h}$	3	$H_{d e p t h}$	3 $\times$ 3 $\times$ 64
Num. of Conv in $H_{i n i t}$	2	$H_{i n i t}$	3 $\times$ 3 $\times$ 128
Num. of Conv in $H_{t a i l}$	3	$H_{t a i l}$	3 $\times$ 3 $\times$ 128
Learning Rate	$1 e^{- 4}$	Epoch	200
Optimizer	Adam	Activation	ReLu

PSNR/LPIPs	Baseline [33]			CNN [17]
PSNR/LPIPs	Ave.	Min	Max	Ave.	Min	Max
Blue room	26.68/0.23	22.94/0.27	31.58/0.20	26.69/0.27	23.87/0.31	28.74/0.23
Bottles	29.14/0.20	25.14/0.24	33.09/0.17	27.11/0.29	23.79/0.33	30.40/0.24
Coffee time	21.11/0.30	18.80/0.36	22.33/0.26	23.15/0.27	21.87/0.29	24.28/0.24
Toy bricks	29.94/0.19	25.98/0.23	32.46/0.16	30.46/0.24	27.31/0.28	32.40/0.19
Origami	29.80/0.13	26.15/0.18	33.17/0.10	27.47/0.23	24.58/0.27	30.29/0.19
Bedroom	33.42/0.21	30.68/0.25	35.76/0.18	31.26/0.34	29.40/0.37	32.74/0.30
Grids	22.65/0.29	18.78/0.35	24.10/0.27	22.36/0.35	18.55/0.41	23.82/0.31
Light	28.34/0.21	24.99/0.26	31.02/0.17	27.03/0.32	24.34/0.37	29.15/0.28
Landscapes	36.32/0.18	32.16/0.22	41.28/0.16	34.90/0.35	31.45/0.38	38.19/0.32
Mirrors & Transparency	36.88/0.17	33.19/0.22	40.14/0.14	35.24/0.27	32.23/0.23	37.74/0.24
Ave.	29.43/0.28	25.88/0.26	32.49/0.18	28.57/0.29	25.74/0.32	30.78/0.25
PSNR/LPIPs	Ours			Ours + GDBW
PSNR/LPIPs	Ave.	Min	Max	Ave.	Min	Max
Blue room	28.71/0.19	25.22/0.23	33.31/0.15	28.28/0.15	23.38/0.22	40.69/0.03
Bottles	29.73/0.20	25.68/0.24	32.62/0.10	28.82/0.16	23.63/0.24	41.68/0.03
Coffee time	30.28/0.07	26.89/0.10	33.92/0.06	29.71/0.06	24.76/0.10	40.42/0.01
Toy bricks	32.70/0.15	28.06/0.20	36.06/0.11	32.01/0.12	25.96/0.20	45.26/0.02
Origami	30.39/0.13	26.69/0.17	33.98/0.09	29.82/0.10	24.55/0.17	42.95/0.01
Bedroom	33.59/0.21	30.99/0.25	35.95/0.18	32.97/0.16	28.78/0.22	44.98/0.03
Grids	23.88/0.28	19.40/0.35	25.16/0.23	22.39/0.24	15.93/0.39	35.77/0.05
Light	28.42/0.22	25.26/0.26	30.94/0.18	28.20/0.16	23.19/0.23	39.37/0.04
Landscapes	36.56/0.19	32.55/0.23	41.49/0.16	36.40/0.16	31.48/0.22	46.32/0.06
Mirrors & Transparency	36.99/0.17	33.17/0.23	40.53/0.13	36.56/0.14	32.00/0.21	45.65/0.04
Ave.	31.13/0.18	27.39/0.23	34.40/0.14	30.52/0.15	25.37/0.22	42.31/0.03

Depth-assisted calibration on learning-based factorization for a compressive light field display

Abstract

1. Introduction

2. Related works

2.1 Compressive light field (CLF) display

2.2 Optimization of non-negative tensor factorization (NTF)

3. Algorithm

3.1 Depth-assisted calibration (DAC)

3.2 Feature extraction

3.3 Objective function

4. Experiments

4.1 Evaluation w.r.t. viewpoint reconstruction

4.2 Polarization-based displayed layers

4.3 Evaluation w.r.t. computational efficiency

4.4 Ablation study

4.5 Prototype

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (8)

Tables (3)

Equations (20)

Optics Express