Real-time dense-view imaging for three-dimensional light-field display based on image color calibration and self-supervised view synthesis

Xiao Guo; Xinzhu Sang; Xinzhu Sang; Binbin Yan; Binbin Yan; Huachun Wang; Xiaoqian Ye; Shuo Chen; Huaming Wan; Ningchi Li; Zhehao Zeng; Duo Chen; Peng Wang; Shujun Xing

doi:10.1364/OE.461789

1. Introduction

Recently, the three-dimensional (3D) light-field display has attracted significant attention. With specifically designed optical equipment to modulate light directions, the 3D light-field display can present full-color 3D images with large viewing angles and dense-view [1,2]. However, collecting of dense-view images from the real world is still an obstacle in 3D light-field display, especially real-time collection of dense-view images.

To date, several studies have been proposed to obtain multi-view images. Computer-generated imaging is a popular scheme to generate multi-view images [3–11], and some methods have achieved real-time performance for high-resolution 3D light-field display, such as backward ray-tracing (BRT) [5,6], directional path tracing (DPT) [7,8], multiple reference views depth image based rendering (MDIBR) [9], parallel multi-view polygon rasterization (PMR) [10], and modified single-pass multiview rendering [11]. However, these methods are presented for the virtual 3D scene rather than the real-world 3D scene. Hence these methods cannot generate multi-view images in real scenes. On the other hand, light-field camera and dense-view camera array have been proposed to directly capture multi-view images in recent years, such as Lytro Illum, Raytrix, and Stanford large camera arrays [12,13]. However, due to the limitations of short baselines and complex hardware systems, these schemes are not suitable for real-time 3D light-field images capture.

Recent advances in deep learning have facilitated the investigation of view synthesis with a convolutional neural network (CNN). The major advantage of CNN is that it is effective in extracting features and reconstructing the results of images. Thus, it can be used to synthesize novel views with high quality. Up to now, numerous studies have been attempted to generate realistic novel views based on CNN. To estimate the correspondence of sparse-view images effectively and efficiently, some optical flow-based methods were proposed, such as LiteFlowNet3 [14], and PWC-Net [15]. With accurate flow estimation, the novel view can be correctly synthesized. However, since these algorithms generally need optical flow labels to perform supervised learning, they are hard to work in real scenes, which cannot provide any label. On the contrary, some self-supervised methods have been proposed to generate novel views without ground truth, such as multi-plane image (MPI) [16], multi-parallax view net (MPVN) [17], dense-view synthesis [18], virtual view synthesis [19], and neural radiance fields [20] methods. However, in practice, due to large computation, the speed of these methods cannot meet real-time requirements, which are not suitable for real-time dense-view synthesis.

In addition to the fast dense-view synthesis, inter-camera color consistency is another challenging problem in practical multi-view applications. Due to the different color responses across multi-camera sensors, the captured images generally cannot guarantee color consistency, resulting in performance degradation of many applications. To address this problem, most methods regard one camera as a reference object and others as test cameras, and then they estimate a function to map the colors of test cameras to the reference camera [21]. Traditional color calibration algorithms can be divided into statistic-based and geometry-based methods. Statistic-based approaches employ the statistical analysis to achieve color calibration, such as the mean and variance of images [22], or the histogram matching [23,24]. However, these methods are image-dependent mapping, which would cost much time in practical applications.

In contrast, geometry-based methods use the corresponding features among different cameras to achieve scene-dependent mapping. In practice, the corresponding features can be provided with a designed color checker [25] or a standard color checker [26], and then the color correction parameters or color correction matrix (CCM) are output, which can be used in not only current images, but also other images with the same illumination. Although geometry-based methods are feasible for our real-time application with CCM pre-estimation, the image quality of color calibration cannot be satisfied due to the corresponding sparse features provided, and it is still required improvement.

Here, a real-time dense-view imaging method for 3D light-field display is proposed based on image color calibration and self-supervised view synthesis. It can synthesize high-quality dense-view images from sparse-view images, while the speed performance can achieve the real-time requirement. In the proposed method, sparse-view images are firstly captured by a sparse camera array. Then, a color calibration model based on a multi-layer perception (MLP) network is introduced to map the colors from source images into the reference image. After homography transformation aligns the sparse-view images horizontally, a lightweight CNN is proposed to estimate the optical flow by self-supervised learning. The dense-view images can be synthesized using inverse warp operation with an accurate optical flow. Experiments results show that the proposed method can realize real-time 3D light-field display at 3840 $\times$ 2160 resolution, and qualitative results demonstrate that the high-quality dense-view 3D light-field display can be achieved.

2. Method

The schematic diagram of the proposed real-time dense-view imaging method is illustrated in Fig. 1. It involves four steps in all: (1) data capture, where the multi-camera array collects the sparse views; (2) image preprocessing, where the initial images with inconsistent color are corrected based on a learning-based color calibration model. Homography transformation is employed to align the images horizontally; (3) dense-view synthesis, where the optical flow of two neighboring views is estimated by the proposed light-weight CNN, and then dense-view images can be generated based on the backward warp with the optical flow; and (4) 3D light-field display, where the dense-view images can be presented on the innovative 3D light-field display in real-time.

Fig. 1. Overview of our proposed method.

Download Full Size | PDF

2.1 Image capture and formation analysis

With a designed multi-camera array, images from different perspectives can be captured simultaneously. However, since the configuration of different sensors is divergent, the colors of different images may not be identical. Thus, it is essential to calibrate the color of these images for further virtual view synthesis. To better understand the color calibration, the image formation process should be analyzed at first, as shown in Fig. 2. Theoretically, the response of a camera sensor is calculated as [27–30]

(1)$$I_{raw} = \int_{\omega} S(\lambda)E(\lambda)R(\lambda)\,d\lambda,$$

where $I_{raw}$ is the RAW format of a camera with one channel, $\lambda$ represents the wavelength of the incident light, $\omega$ denotes the visible spectrum, $S(\lambda )$ is the camera sensitivity function, and $E(\lambda )$ and $R(\lambda )$ represent the spectrum power distribution and surface reflectance, respectively. After that, the demosaic operation is applied to $I_{raw}$, and an image with three channels is provided as

(2)$$I_{lin} = f_{demos}(I_{raw}),$$

where $I_{lin}$ is the demosaiced image, and $f_{demos}$ is the demosaic operation. In the end, the visible RGB format image is generated with other image signal process (ISP) steps as

(3)$$I_{rgb} = f_{ISP}(I_{lin}),$$

where $I_{rgb}$ is the RGB image, and $f_{ISP}$ involves the ISP operations, such as color adjustment, tone mapping, and compression.

Fig. 2. RGB image formation process.

Download Full Size | PDF

2.2 Color correction model

Here, the color relation between the reference image and test image is discussed, and a learning-based color calibration model is presented to estimate the CCM efficiently.

Given a Lambertian reflectance model, the $E(\lambda )$ and $R(\lambda )$ of the reference and test images are identical. Suppose the coordinates of a pair of matching points in the reference and test images are ($x_1, y_1$) and ($x_2, y_2$), respectively. Based on the theory of proportional relation between raw data and scene radiance [31,32], the raw data of reference and test images can be provided as

(4)$$\begin{cases} I_{raw}^{r}(x_1, y_1) = \alpha^{r} L^{r} \\ I_{raw}^{t}(x_2, y_2) = \alpha^{t} L^{t} \end{cases},$$

where $L^{r}=L^{t}=L$ denotes the scene radiance with same $E(\lambda )$ and $R(\lambda )$, and $\alpha ^{r}$ and $\alpha ^{t}$ represent the proportional factors of reference and test images, respectively. It can be seen from Eq. (4) that since $L^{r}$ and $L^{t}$ are same, hence $I_{raw}^{r}(x_1, y_1)$ and $I_{raw}^{t}(x_2, y_2)$ are also proportional.

Then, the demosaic and ISP operators are enforced into the raw data for the final RGB image. The demosaic step is shown in Fig. 3(a), where the Bayer pattern raw data is used to generate the three channels image with the linear measurement of raw data retained [32]. Thus, the demosaiced data can be represented as

(5)$$\begin{cases} I_{lin}^{r}(x_1, y_1) = \beta^{r}(I_{raw}^{r}(x_1, y_1)) = \gamma^{r}L^{r} \\ I_{lin}^{t}(x_2, y_2) = \beta^{t}(I_{raw}^{t}(x_2, y_2)) = \gamma^{t}L^{t} \end{cases},$$

where $\beta ^{r}$ and $\beta ^{t}$ are the linear factors of reference and test demosaiced data, respectively, and $\gamma ^{r}=\beta ^{r}\alpha ^{r}$ and $\gamma ^{t}=\beta ^{t}\alpha ^{t}$ represent the proportional factors between demosaiced data and scene radiance of reference and test images, respectively. Finally, the ISP step involving non-linear operators are applied into the demosaiced data, and the RGB image is generated as

(6)$$\begin{cases} I_{rgb}^{r}(x_1, y_1) = f_{ISP}^{r}\left(\gamma^{r}L^{r}\right) \\ I_{rgb}^{t}(x_2, y_2) = f_{ISP}^{t}\left(\gamma^{t}L^{t}\right) \end{cases},$$

where $f_{ISP}^{r}$ and $f_{ISP}^{t}$ are the ISP operations of reference and test cameras, respectively. Based on Eqs. (4)–(6), the relation between $I_{rgb}^{r}(x_1, y_1)$ and $I_{rgb}^{t}(x_2, y_2)$ can be computed as

(7)$$f_{ISP}^{r}\left(\frac{\gamma^{r}}{\gamma^{t}}h_{ISP}^{t}\left(I_{rgb}^{t}(x_2, y_2)\right)\right) = I_{rgb}^{r}(x_1, y_1),$$

where $h_{ISP}^{t}$ is the inverse operation of $f_{ISP}^{t}$.

Fig. 3. Color correction process. (a) Bayer pattern and demosaic step, (b) color extraction from 24-patch MacBeth color checker, and (c) the proposed color correction model.

Download Full Size | PDF

Actually, with the explosive popularity of deep learning, the learning-based methods have been proposed to replace the ISP process with a $3\times 3$ transformation matrix estimated by designed CNN [32,33]. Inspired by the success of them, a learning-based approach with MLP is presented to estimate the complicated ISP process with a $3\times 3$ CCM, and Eq. (7) is rewritten as

(8)$$M_{3\times 3}I_{rgb}^{t}(x_2, y_2) = I_{rgb}^{r}(x_1, y_1),$$

where $M_{3\times 3}$ is the estimated CCM.

Given a standard 24-patch MacBeth color checker in the 3D scene, the pixel correspondence of them among multi-camera array is easily extracted, as shown in Fig. 3(b). After that, the mean of each patch is calculated as an input point, and then $M_{3\times 3}$ is output from MLP consisting of five fully connected layers, as shown in Fig. 3(c). Finally, the input points perform matrix multiplication with $M_{3\times 3}$ to generate the color-calibrated points, and the $L_1$ loss function is used to learn the CCM as

(9)$$L_{color}=\left|\left|M_{3\times 3}P_{input} - P_{ref}\right|\right|_1,$$

where $P_{input}$ and $P_{ref}$ are the input and reference points, respectively.

2.3 Homography transformation

After color calibration, the sparse images have the same color. However, since the position of the multi-camera array cannot guarantee horizontal alignment, it is significant to eliminate the vertical disparity prior to performing virtual view synthesis. In this paper, a chessboard is used to provide four-corner points on different images, and a homography matrix can be calculated by solving

(10)$$\left(\mathbf{c}_{t_{1}}, \mathbf{c}_{t_{2}}, \mathbf{c}_{t_{3}}, \mathbf{c}_{t_{4}}\right) = H_{3\times 3}\left(\mathbf{c}_{i_1}, \mathbf{c}_{i_2}, \mathbf{c}_{i_3}, \mathbf{c}_{i_4}\right),$$

where $\mathbf {c}_t = (x_t, y_t, 1)^T$ and $\mathbf {c}_i = (x_i, y_i, 1)^T$ represent the corner points of target image and source image, respectively, and $H_{3\times 3}$ denotes the homography matrix. With solved $H_{3\times 3}$, the sparse images can be aligned horizontally.

2.4 Optical flow estimation based on self-supervised learning

In order to generate virtual views in real-time, a practical scheme is proposed to estimate the disparity at one time as fast as possible and generate multiple virtual views based on the backward warp operation, which takes little time and can be executed many times. In practice, since the multi-camera array cannot provide ground truth to supervise the learning process in real-world scenes, it is better to employ self-supervised learning to complete this task.

A lightweight self-supervised learning approach is developed to generate accurate optical flow, as shown in Fig. 4. Given a pair of images from the left view and right view, the corresponding optical flow can be output by the designed CNN. In the training stage, the optical flow incorporated with the left/right view is used to synthesis the right/left view by the backward warp operation, as shown in Fig. 4(a). Furthermore, the loss of synthetic view and target view is utilized to optimize the CNN. In the test stage, the optical flow along with the specific position factor can generate corresponding virtual views using backward warp operation, as shown in Fig. 4(b).

Fig. 4. The diagram of the proposed self-supervised learning approach. (a) The training stage, and (b) the predicting stage.

Download Full Size | PDF

Although the proposed network with a subtle number of parameters and less computation can facilitate the real-time view synthesis, it is hard to guarantee the robustness to different scenes. Thus, a fine-tuning strategy is adopted to improve the generalization of the network. Specifically, for different scenes, the network is firstly trained for about 300 epochs with dynamic images from the same scene, which takes around 10 minutes. This network can fit this scene well, and high-quality optical flow can be estimated to generate dense views.

2.5 Self-supervised network architecture

The detailed network architecture is illustrated in Fig. 5(a). It can be seen that the network is a U-net structure with symmetric encoder and decoder modules. To estimate the optical flow of left view $V_l \in \mathbb {R} ^ {H \times W \times 3}$ and right view $V_r \in \mathbb {R} ^ {H \times W \times 3}$, they are firstly concatenated together, and then input to the encoder module for multi-scale feature extraction as

(11)$$F_i = EB_i\left(\cdots EB_1\left(\left[V_l, V_r\right]\right)_\downarrow\right)_\downarrow,$$

where $H$ and $W$ represent the height and width of the input image respectively, $\downarrow$ is the max pooling operation to down-sample the output feature, $F_i \in \mathbb {R} ^ {\left (H / 2^{i}\right ) \times \left (W / 2^{i}\right ) \times C}$ is the $i$-th feature with feature depth $C$, $i=1,2,\ldots,5$, $[V_l, V_r]$ is the concatenated left and right views, and $EB_i$ represents the $i$-th encoder block, whose structure is shown in Fig. 5(c).

Fig. 5. The self-supervised network architecture. (a) The schematic diagram of the network, (b) the backward warp operation to synthesis novel view, and the details of the (c) encoder black, (d) decoder block, and (e) flow block.

Download Full Size | PDF

Similarly, the decoder module reconstructs the feature of optical flow using skip connection as

(12)$$H_j = DB_{j+1}\left(\cdots \left(DB_1\left(F_5\right)_\uparrow+F_4\right) \cdots \right)_\uparrow,$$

where $\uparrow$ is the bilinear interpolation operation to up-scale the feature, $H_j \in \mathbb {R} ^ {\left (H / 2^{4-j}\right ) \times \left (W / 2^{4-j}\right ) \times C}$ is the $j$-th reconstructed feature, $j = 1,\ldots,4$, $DB$ is the decoder block, whose structure is shown in Fig. 5(d). Note that, to learn the characteristics from features with different scales, four optical flows in different sizes are estimated using a flow block as

(13)$$D_j = FB_{j}\left(H_j\right),$$

where $D_j \in \mathbb {R} ^ {\left (H / 2^{4-j}\right ) \times \left (W / 2^{4-j}\right ) \times 2}$ represents the $j$-th optical flow, and $FB_j$ is the flow block, as shown in Fig. 5(e).

With the output $D_j$, the synthetic view can be generated based on backward warp operation, as shown in Fig. 5(b). Specifically, in the training stage, to optimize the network with the proposed self-supervised scheme, the left views with different scales are firstly generated with $D_j$ and the corresponding right views as

(14)$$V_l^{j'}(x, y) = V_r^j(x + D_j^h(x, y), y),$$

where $x$ and $y$ are the coordinates of the left view, and $V_l^{j'}$ and $V_r^j$ are the synthetic left view and input right view, respectively, whose width and height are identical with $D_j$ employing down-sample operation. It is noticed that the homography transformation has been applied in the sparse views to eliminate the vertical disparity. Thus, only the horizontal part of $D_j$ is employed to perform the backward warp operation. Likewise, in the test stage, the virtual view can be synthesized with an additional position factor as

(15)$$V_{novel}(x, y) = V_r(x + \omega D_4^h(x, y), y),$$

where $V_{novel}$ is the novel view, and $\omega \in (0, 1)$ is the position factor of the virtual view.

Furthermore, to pre-train the proposed network effectively, the Flickr1024 dataset is used as a training dataset, which involves 1024 pairs of binocular images [34]. In addition, the loss function of the network is defined as

(16)$$L_{flow} = \frac{1}{4}\sum_{j=1}^{4}\alpha_j\left|\left|V_l^{j'} - V_l^j\right|\right|_1,$$

where $\alpha \in \left \{0.4, 0.6, 0.8, 1.0\right \}$ for $j=1, 2, 3, 4$.

3. Implementations

3.1 Experimental setup

To effectively implement the training process, different experimental setups are employed into the color calibration and optical flow estimation models, as shown in Table 1. For the color calibration model, 72 (24 $\times$ 3) color features are input to learn a 3 $\times$ 3 color correction matrix. The number of channel in the model is 256, and the batch size is 10. The training process is stopped after 100 epochs with a fixed learning rate of 1 $\times 10^{-3}$. For optical flow estimation model, the input images are firstly reshaped to $256 \times 512 \times 3$ and then used to learn the optical flow. The number of channel in this model is 32, and the batch size is 4. This network is trained for 300 epochs with a fixed learning rate of $1\times 10^{-4}$. Note that for both models, the Kaiming method is used to initialize the parameters of them [35], and the Adam method is used as the gradient descent optimized strategy [36].

Table 1. The experimental setup of color calibration and optical flow estimation models.

View Table | View all tables in this article

3.2 Multi-camera array and 3D light-field display device

To capture the sparse input views, a multi-camera array including 11 Blackmagic Micro Cinema Camera is used as the capture system, as shown in Fig. 6. The coverage of the multi-camera array is about $60^{\circ }$, which provides a large view angle for the 3D light-field display. For the display device, the innovated 27-inches 3D light-field display with $3840 \times 2160$ resolution is used to present the dense-view results, as shown in Fig. 7(a), and the parameters of the display device are shown in Table 2. Moreover, the sub-pixel arrangement principle of the 3D light-field display is illustrated in Fig. 7(b). Given dense-view images, the view number of each sub-pixel position $(i, j, k)$ can be calculated as [10]

(17)$$N(i, j, k) = \left(\frac{3j+k-3i\tan\alpha}{L}-floor\left(\frac{3j+k-3i\tan\alpha}{L}\right)\right)\times N,$$

where $\alpha$ is the inclined angle of the lenticular-lens array, $N$ is the number of views, and $L=p/\left (p_w\cos \alpha \right )$ represent the number of sub-pixels covered by a lenticular-lens, where $p$ is the length of the lenticular-lens pitch, $p_w$ is the width of the sub-pixel pitch.

Fig. 6. The structure of multi-camera array.

Download Full Size | PDF

Fig. 7. The structure of 3D light-field display. (a) The components of the 3D light-field display, and (b) the sub-pixel arrangement principle of 3D light-field display.

Download Full Size | PDF

Table 2. The parameters of 3D light-field display.

View Table | View all tables in this article

3.3 Real-time 3D light-field display system pipeline

To achieve real-time 3D light-field display, parallel computation is indispensable in the dense-view generation based on GPU, as shown in Fig. 8. In the experiments, the configuration of PC hardware involves an Intel Core i7-11700K CPU @ 3.6GHz with 32GB RAM and an NVIDIA RTX 3090 GPU. The CPU performs parameters and images loading, and the GPU performs data pre-processing, including color calibration and homography transformation, optical flow estimation, backward warp, and sub-pixel arrangement. With the LibTorch deep learning architecture [37], CUDA toolkit, and OpenGL under C++ language, the parallel computing system can significantly accelerate the generation process and enable the real-time 3D light-field display.

Fig. 8. The system pipeline of the proposed scheme.

Download Full Size | PDF

4. Experimental results

This section firstly analyzes the performance of the proposed color calibration approach and then evaluates the image quality and computational performance of the developed self-supervised optical flow estimation scheme. Finally, the synthetic dense-views are presented on the 3D light-field display to further demonstrate the effectiveness of our proposed method.

4.1 Color calibration performance evaluation

To demonstrate the proposed color calibration approach, three traditional color calibration methods, including HHM [24], CCM [26], and R [22], are used to compare the reconstructed results with our method. Since the labels after color calibration are not available in practice, identical features on all views are employed to perform the quantitative analysis.

In our experiments, the 24 color points on the MacBeth color checker are used as identical features to compare the calibrated results and the reference points, and the mean absolute error (MAE) results of them are employed to evaluate the performance of color calibration, which is calculated as

(18)$$MAE = \frac{1}{N}\sum_{i=1}^{N}\left|Ref_i - Pred_i\right|,$$

where $N = 24$ represents 24 color points, $Ref_i$ is the reference point, and $Pred_i$ is the calibrated point. Moreover, the MAE results are shown in Fig. 9. It can be seen that the result of our method is superior to other methods in both mean and standard deviation. Note that the HHM method presents a better result than ours on the central views. However, since this method significantly relies on feature matching, when the side views cannot provide sufficient matching features, the performance would vitally degrade. In addition, the qualitative results further verify the effect of our method, as shown in Fig. 10. Thanks to the learning-based color correction model, our method can present satisfied visualization results.

Fig. 9. Quantitative evaluation of different methods on color calibration performance.

Download Full Size | PDF

Fig. 10. Visualized results about color calibration using different methods. (a) The results of 24-patch MacBeth color checker, and (b) the results of human body.

Download Full Size | PDF

4.2 Image quality of synthetic view evaluation

To validate the synthetic image quality using our proposed method, two supervised optical flow estimation methods with high-computational performance, PWC-Net [15] and LiteFlowNet3 [14], are used to compare with our method. Since the ground truth of virtual view is deficient, the left view synthesized by the estimated optical flow and right view is employed to compute the quantitative metrics (PSNR and SSIM), as shown in Figs. 11 and 12. Note that, for a fair comparison, the PWC-Net and LiteFlowNet3 are fine-tuned in our datasets, using a similar self-supervised manner as ours, denoted as PWC-Net-s and LiteFlowNet3-s in Figs. 11 and 12, respectively.

Fig. 11. Visualized results of synthetic left view using different methods. The first row of each sub-figure is the synthetic results, the second row is the EPI results, and the third row is the residual error maps. (a) The human body 1, (b) the human body 2, and (c) the Dog dataset in [38].

Download Full Size | PDF

Fig. 12. (a) Image quality comparison of different method. (b) The image quality and computational performance comparison with different feature channels. (c) The image quality and computational performance comparison with different feature layers.

Download Full Size | PDF

In our experiments, three datasets are used to verify the reconstructed results, as shown in Fig. 11. The prior two datasets in Figs. 11(a) and (b) are collected by our multi-camera array, and the last dataset is obtained from Fujii Laboratory at Nagoya University [38] to further evaluate the effectiveness of our method. For qualitative comparison, the generated left view, as well as the residual error map and the epipolar plane image (EPI) of dense-view, are shown in Fig. 11. It can be seen that our method can achieve better image quality than other methods, and the visualized EPI result of our method is also superior to others, which demonstrates the reconstructed effect of our method. For quantitative comparison, the average PSNR of three datasets between the synthetic left views and the reference left views are calculated and illustrated in Fig. 12(a). We can see that our method can achieve higher PSNR around 30dB in different view positions. These experiments demonstrate that the proposed method can present satisfying image quality for 3D light-field display.

Moreover, to validate the effectiveness of the proposed color calibration approach for view synthesis, the original images without color calibration are utilized to estimate the optical flow and generate novel views. The visualized results are shown in the last column of Fig. 11, and the PSNR result is shown in Fig. 12(a). Note that since the dataset in Fig. 11(c) does not provide a corresponding color checker, only first two datasets are used to verify the color calibration, as shown in Figs. 11(a) and (b). It can be seen that when the color of left and right views are not consistent, the reconstructed results are inferior with incorrect optical flow estimation. In addition, the quantitative result in Fig. 12(a) suggests that the PSNR of synthetic image is significantly influenced by the view position, which provides inconsistent results. Therefore, it is essential to employ color calibration for better view synthesis in practice.

4.3 Computational performance evaluation

In this part, the computational performance of our proposed method is analyzed. Firstly, the number of parameters (#Params.) and the floating point operations (FLOPs) of different models are listed in Table 3. It can be seen that although the PWC-Net and LiteFLowNet3 present promising performance in computation, the number of parameters and FLOPs are still larger than our model, whose parameters and FLOPs are less than 0.3M and 20G, respectively. Moreover, the computational speed of different methods is illustrated in Fig. 13 from two aspects.

Fig. 13. Computational performance comparisons of different methods using different (a) image resolution, and (b) batch size.

Download Full Size | PDF

Table 3. Quantitative comparisons of the number of parameters and the FLOPs for different optical flow estimation methods. The FLOPs is calculated using $10 \times 6 \times 1024 \times 512$ images as input.

View Table | View all tables in this article

On the one hand, 10 color image pairs are input to the network with different resolutions, and the results are shown in Fig. 13(a). It can be seen that for different resolutions, the computational time of LiteFlowNet3 is larger than 50ms. Likewise, with the increment of image resolution, the computational time of PWC-Net rapidly increases to over 50ms. Thus, these two methods cannot achieve the requirement of real-time over $512 \times 256$ resolution. In contrast, our method can achieve real-time computation at the most resolution, whose computational time is less than 50ms.

On the other hand, different batch size of the image at $1024 \times 512$ resolution is input to the network, and the result is shown in Fig. 13(b). Similarly, our method can present low computational time with different batch sizes, while the other two methods take much time to estimate optical flow with different batch sizes. Consequently, our method presents a superior performance in computational speed than the other two methods to achieve real-time view synthesis.

Moreover, to validate the architecture of our proposed network, two variants of the network are designed to evaluate the network in both image quality and computational performance. To evaluate the effect of the feature channel, the different number of feature channels is used in our network, and the result is shown in Fig. 12(b). It can be seen that with the increment of the feature channel, the computational time would significantly increase, but the PSNR metric fluctuates around 30dB. Hence, the number of feature channels is not essential to be large, and 32 feature channels are used in our network. On the other hand, the depth of multi-scale feature layers in the U-net architecture is another critical factor that influences image quality and computational performance. To verify its effectiveness, five different layers from 3 to 7 are used in the experiment, and the result is shown in Fig. 12(c). As seen from the figure, the PSNR and computational time both increase when the number of layers rises. However, compared with the improvement of PSNR, the increment of computational time would impact the real-time performance, especially when the number of layers is larger than 6. Thus, 5 feature layers are employed in our network to balance the image quality and computational performance.

4.4 Presenting on 3D light-field display

To present a high-quality 3D light-field display in real-time, 60 virtual views are generated using 11 input images at $1024 \times 512$ resolution. After that, the 4K ($3840 \times 2160$) encoding image is synthesized based on Eq. (17) for 3D light-field display. Since the encoding process is performed in the GPU pipeline, the computational time can be neglected. Finally, different models are presented on our innovative 27-inch 3D light-field display device over 25fps, as shown in Fig. 14. It can be seen that the synthetic views can present a high-quality imaging effect with smooth and correct parallax. Consequently, the experimental results demonstrate the effectiveness of our method for real-time 3D light-field display.

Fig. 14. The decoding images and the results of 3D light-field display from different perspectives (see Visualization 1).

Download Full Size | PDF

5. Conclusion

In summary, a real-time imaging method is proposed to achieve a 3D light-field display with real-world scenes. Firstly, a learning-based color calibration method is proposed to correct the image color of different views. Using the matching points provided by the color checker, a $3 \times 3$ color correction matrix can be learned based on the MLP network to perform color calibration. After that, a lightweight CNN network is proposed to estimate the optical flow for further view synthesis. Since no label exists in the experiments, a self-supervised learning scheme is introduced to estimate the optical flow. Experimental results show that our proposed method can realize 60 view synthesis at $1024 \times 512$ resolution over 20 fps, and real-time 3D light-field display at $3840 \times 2160$ resolution can be achieved. These results suggest that our method can achieve real-time dense-view 3D light-field display under real-world scenes. It is believed that the proposed method is helpful for the development of real-world 3D light-field displays in real-time.

Funding

National Natural Science Foundation of China (61905017, 61905020, 62075016, 62175017).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. X. Sang, X. Gao, X. Yu, S. Xing, Y. Li, and Y. Wu, “Interactive floating full-parallax digital three-dimensional light-field display based on wavefront recomposing,” Opt. Express 26(7), 8883–8889 (2018). [CrossRef]

2. X. Yu, X. Sang, S. Xing, T. Zhao, D. Chen, Y. Cai, B. Yan, K. Wang, J. Yuan, and C. Yu, “Natural three-dimensional display with smooth motion parallax using active partially pixelated masks,” Opt. Commun. 313, 146–151 (2014). [CrossRef]

3. K. Yanaka, “Integral photography using hexagonal fly’s eye lens and fractional view,” Proc. SPIE 6803, 68031K (2008). [CrossRef]

4. H. Ren, Q. H. Wang, Y. Xing, M. Zhao, L. Luo, and H. Deng, “Super-multiview integral imaging scheme based on sparse camera array and cnn super-resolution,” Appl. Opt. 58(5), A190–A196 (2019). [CrossRef]

5. S. Xing, X. Sang, X. Yu, D. Chen, B. Pang, X. Gao, S. Yang, Y. Guan, B. Yan, and J. Yuan, “High-efficient computer-generated integral imaging based on the backward ray-tracing technique and optical reconstruction,” Opt. Express 25(1), 330–338 (2017). [CrossRef]

6. B. Pang, X. Sang, S. Xing, X. Yu, D. Chen, B. Yan, K. Wang, C. Yu, B. Liu, and C. Cui, “High-efficient rendering of the multi-view image for the three-dimensional display based on the backward ray-tracing technique,” Opt. Commun. 405, 306–311 (2017). [CrossRef]

7. Y. Li, X. Sang, S. Xing, Y. Guan, S. Yang, D. Chen, L. Yang, and B. Yan, “Real-time optical 3d reconstruction based on monte carlo integration and recurrent cnns denoising with the 3d light field display,” Opt. Express 27(16), 22198–22208 (2019). [CrossRef]

8. X. Guo, X. Sang, D. Chen, P. Wang, H. Wang, X. Liu, Y. Li, S. Xing, and B. Yan, “Real-time optical reconstruction for a three-dimensional light-field display based on path-tracing and cnn super-resolution,” Opt. Express 29(23), 37862–37876 (2021). [CrossRef]

9. Y. Guan, X. Sang, S. Xing, Y. Li, and B. Yan, “Real-time rendering method of depth-image-based multiple reference views for integral imaging display,” IEEE Access 7, 170545–170552 (2019). [CrossRef]

10. Y. Guan, X. Sang, S. Xing, Y. Chen, Y. Li, D. Chen, X. Yu, and B. Yan, “Parallel multi-view polygon rasterization for 3d light field display,” Opt. Express 28(23), 34406–34421 (2020). [CrossRef]

11. Y. Li, X. Sang, S. Xing, Y. Guan, S. Yang, and B. Yan, “Real-time volume data three-dimensional display with a modified single-pass multiview rendering method,” Opt. Eng. 59(10), 102412 (2020). [CrossRef]

12. Raytrix, “3d light-field vision,” http://www.raytrix.de/.

13. B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy, “High performance imaging using large camera arrays,” ACM Trans. Graph. 24(3), 765–776 (2005). [CrossRef]

14. T. W. Hui and C. C. Loy, “Liteflownet3: Resolving correspondence ambiguity for more accurate optical flow estimation,” in European Conference on Computer Vision, (Springer, 2020), pp. 169–184.

15. D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 8934–8943.

16. T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” arXiv preprint arXiv:1805.09817 (2018).

17. D. Chen, X. Sang, W. Peng, X. Yu, and H. Wang, “Multi-parallax views synthesis for three-dimensional light-field display using unsupervised cnn,” Opt. Express 26(21), 27585–27598 (2018). [CrossRef]

18. D. Chen, X. Sang, P. Wang, X. Yu, B. Yan, H. Wang, M. Ning, S. Qi, and X. Ye, “Dense-view synthesis for three-dimensional light-field display based on unsupervised learning,” Opt. Express 27(17), 24624–24641 (2019). [CrossRef]

19. D. Chen, X. Sang, P. Wang, X. Yu, X. Gao, B. Yan, H. Wang, S. Qi, and X. Ye, “Virtual view synthesis for 3d light-field display based on scene tower blending,” Opt. Express 29(5), 7866–7884 (2021). [CrossRef]

20. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European conference on computer vision, (Springer, 2020), pp. 405–421.

21. H. S. Faridul, T. Pouli, C. Chamaret, J. Stauder, E. Reinhard, D. Kuzovkin, and A. Trémeau, “Colour mapping: A review of recent methods, extensions and applications,” in Computer Graphics Forum, vol. 35 (Wiley Online Library, 2016), pp. 59–88.

22. E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley, “Color transfer between images,” IEEE Comput. Grap. Appl. 21(4), 34–41 (2001). [CrossRef]

23. U. Fecker, M. Barkowsky, and A. Kaup, “Histogram-based prefiltering for luminance and chrominance compensation of multiview video,” IEEE Trans. Circuits Syst. Video Technol. 18(9), 1258–1267 (2008). [CrossRef]

24. C. Ding and Z. Ma, “Multi-camera color correction via hybrid histogram matching,” IEEE Trans. Circuits Syst. Video Technol. 31(9), 3327–3337 (2021). [CrossRef]

25. K. Li, Q. Dai, and W. Xu, “High quality color calibration for multi-camera systems with an omnidirectional color checker,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, (IEEE, 2010), pp. 1026–1029.

26. Opencv, “Color correction model,” https://docs.opencv.org/4.x/de/df4/group__color__correction.html.

27. B. A. Wandell, “The synthesis and analysis of color images,” IEEE Trans. Pattern Anal. Mach. Intell. PAMI-9(1), 2–13 (1987). [CrossRef]

28. W. Shi, C. C. Loy, and X. Tang, “Deep specialized network for illuminant estimation,” in European Conference on Computer Vision, (Springer, 2016), pp. 371–387.

29. X. Yang, X. Jin, and J. Zhang, “Improved single-illumination estimation accuracy via redefining the illuminant-invariant descriptor and the grey pixels,” Opt. Express 26(22), 29055–29067 (2018). [CrossRef]

30. S. B. Gao, M. Zhang, and Y. J. Li, “Improving color constancy by selecting suitable set of training images,” Opt. Express 27(18), 25611–25633 (2019). [CrossRef]

31. R. M. Nguyen and M. S. Brown, “Raw image reconstruction using a self-contained srgb-jpeg image with only 64 kb overhead,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 1655–1663.

32. X. Xu, Y. Ma, and W. Sun, “Towards real scene super-resolution with raw images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 1723–1731.

33. E. Schwartz, R. Giryes, and A. M. Bronstein, “Deepisp: Toward learning an end-to-end image processing pipeline,” IEEE Trans. on Image Process. 28(2), 912–923 (2019). [CrossRef]

34. Y. Wang, L. Wang, J. Yang, W. An, and Y. Guo, “Flickr1024: A large-scale dataset for stereo image super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, (2019), pp. 3852–3857.

35. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision, (2015), pp. 1026–1034.

36. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

37. Pytorch, “Install pytorch,” https://pytorch.org/.

38. T. Saito, “Nagoya university multi-view sequences download list,” http://www.fujii.nuee.nagoya-u.ac.jp/multiview-data/.

Parameters	Models
Parameters	Color calibration	Optical flow estimation
Input size	1 $\times$ 72	256 $\times$ 512 $\times$ 3
Output size	3 $\times$ 3	256 $\times$ 512 $\times$ 2
Learning rate	1 $\times$ $10^{- 3}$	1 $\times$ $10^{- 4}$
Epoch	100	300
Batch size	10	4
Channel	256	32

Parameters	Value
Resolution	3840 $\times$ 2160
LCD panel size	27 inches
View number	60
Field of view	60 $^{\circ}$
Viewing distance	120cm
Pixel pitch	1.56mm
Lines per inch	26.4368
Inclined angle	0.1683

Methods	Params.	FLOPs
PWC-Net	9.37M	32.47G
LiteFlowNet3	7.52M	98.81G
Ours	0.20M	18.67G

Parameters	Models
Parameters	Color calibration	Optical flow estimation
Input size	1 $\times$ 72	256 $\times$ 512 $\times$ 3
Output size	3 $\times$ 3	256 $\times$ 512 $\times$ 2
Learning rate	1 $\times$ $10^{- 3}$	1 $\times$ $10^{- 4}$
Epoch	100	300
Batch size	10	4
Channel	256	32

Parameters	Value
Resolution	3840 $\times$ 2160
LCD panel size	27 inches
View number	60
Field of view	60 $^{\circ}$
Viewing distance	120cm
Pixel pitch	1.56mm
Lines per inch	26.4368
Inclined angle	0.1683

Real-time dense-view imaging for three-dimensional light-field display based on image color calibration and self-supervised view synthesis

Abstract

1. Introduction

2. Method

2.1 Image capture and formation analysis

2.2 Color correction model

2.3 Homography transformation

2.4 Optical flow estimation based on self-supervised learning

2.5 Self-supervised network architecture

3. Implementations

3.1 Experimental setup

3.2 Multi-camera array and 3D light-field display device

3.3 Real-time 3D light-field display system pipeline

4. Experimental results

4.1 Color calibration performance evaluation

4.2 Image quality of synthetic view evaluation

4.3 Computational performance evaluation

4.4 Presenting on 3D light-field display

5. Conclusion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (14)

Tables (3)

Equations (18)

Optics Express