Fast virtual view synthesis for an 8K 3D light-field display based on cutoff-NeRF and 3D voxel rendering

Shuo Chen; Binbin Yan; Xinzhu Sang; Duo Chen; Peng Wang; Zeyuan Yang; Xiao Guo; Chongli Zhong

doi:10.1364/OE.473852

1. Introduction

The 3D light field display can record and reconstruct the light emitted from points on 3D objects, reconstruct the spatial characteristics of 3D scenes, and correctly represent the occlusion relationship between different objects [1,2]. 3D light fields can be widely used in education, video communication, and other fields [3] to improve learning enthusiasm and conference efficiency, thus attracting widespread attention from researchers. However, the high-resolution 3D light field display of real 3D scenes is a very challenging, which limits the development and application of 3D light field display. Some methods have been proposed to obtain multi-views. Light field cameras, such as lytro Illum [4,5], can capture light in 4D, encoding both the position and direction of light rays striking the sensor. But the baseline is too small, and the viewing angle is also too narrow. In the multi-camera array, Stanford University used 128 cameras to build a multi-view camera array to capture large view angle light field data [6]. However, it is difficult for the multi-camera array to ensure that the distance between the cameras is narrow enough, resulting in poor 3D image reconstruction, high requirements on equipment, and high consumption of computing power. The model-based rendering (MBR) uses objects’ geometry, and surface attributes to construct images from a given viewpoint [7]. However, acquiring a reasonable level of detail in the surface reflection model and geometries is usually challenging, which results in unrealistic synthetic viewpoint images. Image-based Rendering (IBR) [8], which uses sparse viewpoint image feature information to reconstruct the content distribution of the scene and synthesize dense views, is usually used for generating 3D light field display content. However, the traditional IBR algorithm needs more manual operations to fill the holes and reduce artifacts [9].

Deep learning can learn the inherent laws of sample data, which uses the learned feature information to solve many computer vision problems, such as depth estimation [10,11], 3D reconstruction [12], and virtual viewpoint synthesis [13]. The deep learning-based virtual view synthesis method can synthesize high-quality virtual views by using the feature information and pose of sparse images, such as SynSin [14], Local Light Field Fusion (LLFF) [15,16], NeRF [17–19]. SynSin is an end-to-end view synthesis method based on a single image, which uses CNN to estimate the monocular camera depth and convert the latent 3D feature point cloud into target views. But it is hard to be used for multi-view viewpoint synthesis because the estimated depth scale is inconsistent for each image. LLFF uses 3D CNN to promote each input view sample to a multiplane image (MPI) [20,21] scene representation, consisting of RGB$\alpha$ planes at regularly sampled disparities within the input view’s camera frustum, and then blends RGB$\alpha$ to synthesize novel view. LLFF can synthesize photorealistic novel views, but it only addresses well-sampled forward-facing scene viewpoint synthesis. NeRF is the most preferred algorithm in neural implicit representation methods, which can synthesize novel views of complex scenes by optimizing the underlying continuous volume scene function using a set of sparse input views. However, this method has a huge limitation in practice. It takes a long time to synthesize a novel view, which is hard to be widely used for viewpoint synthesis of real scenes. In addition, there are also virtual view synthesis methods based on scene tower blending [22,23]and position-guiding CNN [24]. Still, neither of them can fast synthesize high-quality virtual views and 3D images.

Here, a novel virtual view synthesis method for the 8K 3D light field display based on cutoff-NeRF and 3D voxel rendering is proposed. The proposed method can synthesize high-resolution virtual views in real and physical simulation scenes and rapidly calculate the 3D image for the 3D light field display. Additionally, the method can synthesize virtual views in complex scenes with large baselines and wide viewing angles. First, the poses of the sparse views are calibrated using a multi-view calibration algorithm. Camera internal and external parameters are used to promote each input view sample to 3D space, getting the position $(x,y,z)$ and direction $(\theta,\varphi )$ at regularly sampled disparities within the input view’s camera frustum. The position $(x,y,z)$ is encoded and input into the cutoff-NeRF neural network. The cutoff-NeRF network outputs volume density and RGB, and the pixel values are calculated by ray integration. The proposed cutoff-NeRF network is trained end-to-end by minimizing the color mean squared error. Second, a coarse-to-fine 3D voxel scene representation is proposed to quantify the scene content distribution learned by the cut-off NeRF network, thereby accelerating the synthesis of high-quality views. Finally, a 3D voxel-based off-axis pixel encoding method is proposed to speed up the synthesis of high-resolution 3D images. Experimental results demonstrate the validity of our proposed method. PSNR of the virtual view is about 29.75 dB, SSIM is about 0.88, and the time of synthetic 8K 3D images is about 14.41s.

The schematic comparison between different methods is shown in Fig. 1. Classic deep learning-based volume reconstruction methods, such as LLFF, use a CNN to compute the MPI of each input view and combine the camera pose to render the target view from the adjacent MPI, as shown in Fig. 1(a). LLFF assumes that the poses of the source images lie on a plane, so it can only achieve virtual viewpoint synthesis of the forward-facing scenes. At the same time, LLFF needs to generate a group of virtual viewpoints before synthesizing 3D images, so the encoding is inefficient. Compared with the classical volume reconstruction methods, NeRF uses differential volume rendering to achieve a coordinate-based implicit representation of the scene, as shown in Fig. 1(b). NeRF can represent arbitrary topology but takes a long time to render a target image. NeRF also needs to generate a set of virtual views to synthesize 3D images. In contrast, the proposed method can represent arbitrary topology to synthesize high-resolution novel views and do off-axis pixel encoding combined with 3D voxel to synthesize 3D images fast. Our method first presents a cutoff-NeRF-based implicit representation of the scene to learn the scene distribution from sparse views, as shown in Fig. 1(c). Then a 3D voxel-based scene quantization method is proposed to quantify the scene content distribution learned by cutoff-NeRF and speed up view synthesis. Finally, by combining 3D voxels with off-axis pixel encoding, a fast method for generating 3D content is proposed for 3D light field display. Notably, the proposed cutoff-NeRF can improve the quality of novel views. Additionally, the coarse-to-fine 3D voxel quantization method can effectively improve the quality of virtual viewpoint synthesis and the rendering speed.

Fig. 1. The schematic comparison between different methods. (a) is the LLFF method. (b) is the NeRF method. (c) is our proposed method.

Download Full Size | PDF

2. Method

The overall architecture of the proposed algorithm is given in the following Fig. 2. Firstly, the camera poses of all sparse input views are calibrated. The camera internal and external parameters are used to sample 3D coordinates $(x,y,z)$ and 2D view direction $(\theta,\varphi )$ along with camera rays. 3D coordinates of all sampling points are encoded and input into the cutoff-NeRF fully connected network to represent and learn the content distribution of the scene. Then, the coarse-to-fine 3D voxel method is used to quantify the scene content learned by the neural network to synthesize virtual views. Finally, 3D voxel-based off-axis pixel encoding method is used to synthesize high-resolution 3D images for the 3D light field display.

Fig. 2. The overall approach of our proposed method. (a) The poses of the sparse views are calibrated using a multi-view calibration algorithm. Coordinates $(x,y,z)$ are sampled along with the camera rays and encoded into the coarse MLP network. The spatial sampling points are resampled and then encoded into the cutoff network. (Blue points are the optimized sampling points.) (b) The coarse-to-fine 3D voxel representation method is proposed to quantify the scene content learned by the MLP network. (c) The 3D image is synthesized by the 3D voxel-based off-axis pixel encoding for the 3D light field display.

Download Full Size | PDF

2.1 Cutoff-neural radiance fields

3D light field display can display 3D content with a large viewing angle, so the viewing angle of sparse images must be large enough. However, the MPI and scene tower blending view synthesis methods have poor view synthesis quality in large-view scenes. NeRF can represent arbitrary topologies, but it is easy to lose image details due to unreasonable spatial sampling points. Because NeRF doesn’t redistribute the uniform sampling points to the most likely scene content range in the second stage. So the cutoff-NeRF method is proposed to improve the quality of novel views, as shown in Fig. 3.

Fig. 3. According to the intrinsic and extrinsic parameters of the camera, coordinates $(x,y,z)$ are sampled along the camera ray and encoded into the coarse MLP network. The volume density $\sigma$ values predicted by the coarse network are used to calculate the spatial content distribution probability and sampling point position of the cutoff MLP. Then, the resampling points are encoded into cutoff MLP to predict the corresponding color value and volume density. Finally, the photometric loss is used as the loss function to optimize the neural network.

Download Full Size | PDF

Cutoff-NeRF consists of the coarse MLP and cutoff MLP to synthesize novel views from sparse views with known camera poses. We input a set of N images $\{I_n\}^{N}_{n=1}$ and utilize COLMAP [25] to calculate camera poses (rotation matrix $\{R_n\}^{N}_{n=1}$ and translation vectors $\{T_n\}^{N}_{n=1}$). COLMAP can also estimate the focal length and the internal parameters $\{K_n\}^{N}_{n=1}$ of the cameras. $\{R_n\}^{N}_{n=1}$ and $\{T_n\}^{N}_{n=1}$ are used to promote each input image sample to 3D space, getting the position $(x,y,z)$ and direction $\mathscr {D}$ at regularly sampled disparities within the input images’ camera frustum. The sampling equation is expressed as:

(1)$$\begin{array}{c} \mathscr{D} =[\dfrac{w_n^i - {K[0][2]}_n^N}{{K[0][0]}_n^N}, \dfrac{- h_n^j + {K[1][2]}_n^N}{{K[1][1]}_n^N},-1]^T \\ (x,y,z)^T = O_n^N + \mathscr{D} * [d_c * (1.-\dfrac{1}{N_s}) + \dfrac{d_f}{N_s}] \end{array}$$

where $w_n^i$ represents the i-th column of the n-th image, $h_n^i$ represents the i-th row of the n-th image. $O_n^N$ is the world coordinate of the n-th camera center, $N_s$ is the number of sampling points, $d_c$ is the minimum depth of the scene, and $d_f$ is the maximum depth of the scene. To effectively quantify cutoff-NeRF to 3D voxel space, we only input the encoded coordinates into the MLP network. The coarse network weights $w$ are optimized to map coordinate $(x,y,z)$ to volume density ${\sigma }_{c}$ and RGB $c$ values. The weight values $w_{c}$ are used to calculate the value range of uniform sampling points in cutoff-NeRF, as the following expressions:

(2)$$\begin{array}{rl} \mathscr{W} = \{w_1, w_2,\ldots, w_{N_s}\}& , \qquad w_{m} =\sum\limits_{j=1}^{m}{\frac{w_c^j}{\sum\nolimits_{i=1}^{N_s}w_c^i}}\\ F_n^i = \frac{F_{id}^i}{N_s} + near & , \qquad F_{id} = searchIndex(\mathscr{W}, th_f) \end{array}$$

where $\mathscr {W}$ is the normalized set of accumulated weights for each ray, $th_f$ is the weight minimum threshold. $searchIndex$ is used to calculate the position $F_{id}$ of the threshold $th_f$ in $\mathscr {W}$. $near$ is the minimum value of coarse network ray sampling, and $F_n^i$ is the minimum value of the i-th ray sampled in cutoff MLP. According to the above method, the maximum value $B_n^i$ of the i-th ray sampling point can also be calculated. Then, the cutoff MLP uniform sampling points can be calculated as the following expression:

(3)$$\begin{array}{r} z_u = \dfrac{j}{N_r} \{j = 0, 1, 2,\ldots, N_r-1\} \\ z_{sample}^i= F_n^i * (1 - z_u) + B_n^i * z_u \end{array}$$

where $N_r$ is the number of cutoff MLP network samples, $z_{sample}^i$ is the i-th ray sampling points in the cutoff MLP. The above sampling method and the original NeRF important sampling position calculation method are combined to predict the color information. The virtual view is synthesized by using the classical volume rendering method. The volume rendering formula of the synthesized view is as follows:

(4)$$\hat{C}(r) = \sum\limits_{i=1}^{N_s}w_i c_i, \qquad w_i = T_i(1-exp(-{\sigma_i}{t_i}))$$

where $T_i = exp(-\sum\nolimits _{i=1}^{m}{\sigma _i}{t_i})$ denotes the accumulated transmittance along the ray from $i=1$ to $m$, $t_i$ is the distance between adjacent sampling points. The photometric loss is used as the loss function to optimize the neural network, as the following expression:

(5)$$L = \sum\limits_{r \in R(p)} ||\hat{C_c}(r)-C(r)||_{2}^{2} + ||\hat{C_r}(r)-C(r)||_{2}^{2}$$

where $\hat {C_c}(r)$ is the pixel value predicted by the coarse network, $\hat {C_r}(r)$ is the pixel value predicted by the cutoff network, and $C(r)$ is the true pixel value. Then, the coarse-to-fine MLP network is trained by back-propagation to learn the content distribution of the scene.

2.2 Scene quantitative representation based on 3D voxel

According to the approach introduced in Sec 2.1, the scene is implicitly represented by cutoff-NeRF. The cutoff-NeRF network has $1.4690\times 10 ^{26}$ parameters, so the rendering speed is slow. To improve the rendering speed and synthesize high-quality virtual views, We propose a coarse-to-fine 3D voxel scene representation method to quantify the cutoff-NeRF network.

The schematic diagram of the proposed 3D voxel space construction process is shown in Fig. 4. Since cutoff-NeRF is a coarse-to-fine two-stage network, the coarse network provides a probable distribution of scene content for the cutoff network. So single voxel can’t efficiently quantify cutoff-NeRF. The coarse-to-fine 3D voxel space is presented to quantify the scene accurately. First, the voxel space’s starting position $(X_d, Y_d, Z_d)$ is determined according to the camera pose and scene boundary information. Then, the size $(l^c_{x}, l^c_{y}, l^c_{z})$ and $(l^r_{x}, l^r_{y}, l^r_{z})$ of each voxel in the coarse voxel space and the refined voxel space are determined to store the corresponding content of the cutoff-NeRF. Considering that the coarse network only provides the probable distribution of the scene content for the cutoff network, each vertex in the coarse voxel space only stores the corresponding volume density value in cutoff-NeRF. This way, the coarse voxel space can effectively quantify the coarse network content and reduce video memory occupancy. Finally, each vertex in the refined voxel space stores the corresponding RGB and volume density value in the cutoff network. The expression of the coarse-to-fine voxel space quantization is as follows:

(6)$$\begin{array}{c} V_c(x_c,y_c,z_c) = {\mathscr {F}_{\mathscr{c}}}(x_c,y_c,z_c), (x_c,y_c,z_c) = (X_d+l^c_x*i_c, Y_d+l^c_y*j_c, Z_d+l^c_z*k_c)\\ V_r(x_r,y_r,z_r) = {\mathscr {F}_{\mathscr{r}}}(x_r,y_r,z_r), (x_r,y_r,z_r) = (X_d+l^r_x*i_r, Y_d+l^r_y*j_r, Z_d+l^r_z*k_r)\\ (i_c,j_c,k_c) \in [N_{cx}, N_{cy}, N_{cz}],(i_r,j_r,k_r) \in [N_{rx}, N_{ry}, N_{rz}] \qquad\qquad\qquad \end{array}$$

where ${\mathscr {F}_{\mathscr{c}}}$ is the coarse MLP network, ${\mathscr {F}_{\mathscr{r}}}$ is the refined MLP network, $[N_{cx}, N_{cy}, N_{cz}]$ is the size of the coarse voxel space, $[N_{rx}, N_{ry}, N_{rz}]$ is the size of the refined voxel space. $V_c(x_c,y_c,z_c)$ is the volume density value at coordinates $(x_c,y_c,z_c)$, $V_r(x_r,y_r,z_r)$ is the volume density and RGB value in the coordinate $(x_r,y_r,z_r)$.

Fig. 4. The schematic diagram of the proposed 3D voxel space construction process. In world coordinates, a coarse-to-fine 3D voxel space is built. Each vertex in the coarse voxel space stores the volume density in the coarse network. Additionally, the refined voxel space stores the volume density and RGB in the cutoff network.

Download Full Size | PDF

The schematic diagram of the proposed coarse-to-fine 3D voxel rendering method is shown in Fig. 5. Here, the coarse voxel space only stores the volume density to reduce video memory usage and improve rendering speed. When the sampling points are located in the voxel block, trilinear interpolation is used to calculate the volume density at the sampling points, as the following expressions,

(7)$$\begin{aligned} &\sigma_{0} = \sigma_{00}(1-y_d) + \sigma_{10}y_d\\ &\sigma_{1} = \sigma_{01}(1-y_d) + \sigma_{11}y_d\\ &V_c(x , y, z) = \sigma_{0}(1-z_d) + \sigma_{1}z_d \end{aligned}$$

where $(\sigma _{00}, \sigma _{10}, \sigma _{01}, \sigma _{11})$ is the interpolation result of voxel vertices along the X-axis, $(x_d, y_d, z_d)$ is the linear interpolation in the $xyz$ axis direction, $(x,y,z)$ is the coordinate of the sampling point, $V_c$ stores the volume density values. Then, according to the volume density values of all sampling points on the ray, the cutoff sampling method determines the position of refined 3D voxel sample points. Here, the cutoff sampling method is the same as the sampling method in cutoff-NeRF.

Fig. 5. The schematic diagram of the proposed coarse-to-fine 3D voxel rendering method.

Download Full Size | PDF

In the refined voxel rendering stage, the voxel space needs to provide the color and volume density value of each sampling point, so the $RGB\alpha$ at the sampling point needs to be calculated by interpolating each voxel block vertex. Finally, the ray casting integral method is used to synthesize novel virtual views, as shown in the expression (4).

2.3 3D voxel-based off-axis pixel encoding

According to the approach introduced in Sec 2.2, novel views are synthesized by the 3D voxel method. Here, an off-axis pixel encoding method in the 3D voxel space is proposed to compute the element image array for the 3D light field display. The 3D image synthesis method based on virtual viewpoints and camera arrays acquisition method usually requires a set of virtual multi-view images to be encoded for 3D light field display, as shown in Fig. 6(a). These methods significantly reduce the 3D image synthesis speed and limit the application of view synthesis in 3D light field display.

Fig. 6. The diagram of off-axis pixel encoding. (a) 3D image without off-axis pixel encoding. (b) 3D image with off-axis pixel encoding.

Download Full Size | PDF

The 3D voxel-based off-axis pixel coding method can quickly synthesize 3D images and effectively improve the display efficiency, as shown in Fig. 6(b). Here, we create a virtual camera with the same image resolution as the display resolution for generating 3D images and 3D light field displays. Assuming that the resolution of the 3D image is $(W, H)$ and the viewing angle of the virtual camera is $fov$, the following expression can represent the virtual camera:

(8)$$\left\{ \begin{aligned} W & = O_{lookat} - O_{eye} \\ V & = V_d \times length(W) \times tanf( \frac{fov}{2} \times \frac{PI}{180})\\ U & = U_d \times W / H \times tanf( \frac{fov}{2} \times \frac{PI}{180}) \\ \end{aligned} \right\}$$

where $O_{eye}$ is the starting point of the ray, $O_{lookat}$ is the center point of the scene. $V_d$ and $U_d$ are the direction vectors in the local coordinate system of the camera. U, V, and W are the virtual camera parameters in the camera local coordinate system. Then use the line number $L_n$ and inclination $\theta$ of the light field display to calculate the virtual camera label $C_{id}$ at each sub-pixel, as the following expression,

(9)$$C_{id} = \dfrac{(3 * c_{i=1}^W + 3 * r_{j=1}^H * \theta + k_{1,2,3} - \lfloor \dfrac{3 * c_{i=1}^W + 3 * r_{j=1}^H * \theta + k_{1,2,3} }{L_n} \rfloor) \times N_{view}}{L_n}$$

where $(c,r,k)$ is the width, height, and channel number of the 3D image, $N_{view}$ is the number of viewpoints. Then the expression for the position $P_{ori}$ and direction $P_{dir}$ of each sub-pixel corresponding to the virtual camera is:

(10)$$\begin{array}{r} P_{ori} = O_{eye} + Move_{xyz} + Zero_{depth} \times normalize(U) \\ P_{dir} = normalize(u \times V + v \times U + W - Move_{xyz}) \end{array}$$

where $Move_{xyz}$ is the distance offset from the virtual center camera, $Zero_{depth}$ is the zero plane, $(u,v)$ is the value that normalizes the width and height to [−1, 1]. Finally, the ray casting algorithm is used in the 3D voxel space to fast synthesize the 3D image.

3. Implementation and analysis

In this section, we first introduce the method implementation of the proposed algorithm and computer configuration. Then, we will introduce the 3D light field equipment used in this experiment to display the 3D images. In addition, some parameters in the proposed algorithm are analyzed to improve the quality of the synthesis views.

3.1 Method implementation

Three sparse viewpoint datasets are used in this paper, including multi-view forward-facing scenes (real forward-facing), multi-view $360^{\circ }$ scenes (realistic synthetic $360^{\circ }$), and our self-build sparse view scenes. COLMAP algorithm is used to calibrate the sparse view to obtain the internal and external parameters of the camera. $15\%$ of each scene data in the dataset is used to evaluate the quality of the synthesis view, and $85\%$ of each scene data is used for training the cutoff-NeRF network model. Each view is ray sampled within the camera frustum and sets the ray sample number for the coarse and cutoff networks. The ray sampling number in the coarse and cutoff network is set to 64. Then two voxel spaces are built, the coarse voxel space $600\times 600\times 600$ and the refined voxel space $800\times 800\times 800$, to quantify the scene. The neural network and voxel rendering encoding algorithm are programmed with Pytorch and operated on an NVIDIA GeForce RTX 3090Ti GPU. The algorithm takes six-hour to train a scene in the training part. The proposed algorithm only takes 10ms to synthesize a $1920 \times 1080$ resolution 3D image, 2.19s to synthesize a $3840 \times 2160$ resolution 3D image, and 14.41s to synthesize a $7640 \times 4320$ resolution 3D image.

The 3D images synthesized by this algorithm can be used for various 3D light field displays. In the paper, an innovative 3D light field display is used to display the 3D image, which supports the 8K($7680 \times 4320$) resolution image display, as shown in Fig. 7. The device is 166cm wide and 93cm high and can provide 96 viewpoints in $80^{\circ }$ field angles. At the same time, this device can provide smooth motion parallax at the 200cm viewing distance. The proposed virtual view synthesis method can provide high-quality 3D encoded images for 3D light field display and effectively solve the problem of dense viewpoint collection.

Fig. 7. The 3D light field display is used to display 3D images in this paper, which supports 96 views in $80^{\circ }$ view angles.

Download Full Size | PDF

3.2 Analysis

In 3D voxel-based view synthesis, there are two main spatial sampling point interpolation methods: trilinear interpolation and nearest-neighbor interpolation. Here, structural similarity (SSIM), peak signal to noise ratio (PSNR), and rendering time are used to evaluate the quality and speed of two interpolation algorithms on synthetic views. Larger values of SSIM and PSNR signify the higher quality of the synthesized view. The flower scene in the real forward-facing dataset is utilized for synthetic analysis by synthesizing 1008$\times$ 756 resolution virtual multi-views. The coarse voxel space is set to $600\times 600\times 600$, and the refined voxel space is set to $800\times 800\times 800$. As shown in Fig. 8, the metrics at different viewpoint positions are calculated and plotted. We can see that the PSNR and SSIM of virtual views synthesized by the trilinear interpolation method are higher than the nearest interpolation method. So, the synthetic view quality of the trilinear interpolation method is higher than the nearest interpolation method. The reason is that the trilinear interpolation uses the near values of the sampling point to do nonlinear calculations, which can avoid the problem of discontinuous between pixels in the nearest interpolation. However, the computation of trilinear interpolation is larger than that of nearest-neighbor interpolation. So, we simulated the rendering time taken by the two methods to synthesize a virtual view under different sampling numbers. We can see that the rendering time of the two methods is almost similar to 0.1 ms due to running under the GPU. In the following experiments, the interpolation method is set to trilinear interpolation.

Fig. 8. Simulations of different 3D voxel interpolation methods are tested, and the interpolation method is determined. The green is the nearest interpolation method. The orange is the trilinear interpolation method.

Download Full Size | PDF

When the coarse-to-fine voxel explicitly represents the implicit scene, the voxel space size affects the synthesized view’s quality, speed, and video memory occupancy. The flower scene in the real forward-facing dataset is used to evaluate the SSIM, PSNR, and video memory occupancy of the synthesis images in different voxel spaces. Simulations of various voxel space sizes are testified, as shown in Fig. 9. From Fig. 9, we can see that as the voxel size increases, the PSNR and SSIM of the synthesized views from different viewpoints gradually increase. When coarse voxel space and refined voxel space are set to $700\times 700\times 700$, the PSNR and SSIM of the synthesized view are increased less, as shown in Fig. 9 the red area. The reason is that when the coarse voxel space and the refined voxel space reach a certain size, the content distribution of the current scene can be better represented. Moreover, video memory usage has only increased by 5G. Here, to ensure that high-quality virtual views are obtained while reducing the occupation of video memory, the coarse voxel space is set to $600\times 600\times 600$, and the refined voxel space is set to $800\times 800\times 800$ in the following experiment.

Fig. 9. Simulations of different voxel sizes are tested, and coarse voxel size and refined voxel size are determined. (a) and (b) show the corresponding PSNR and SSIM changes of the synthesized views from different viewpoints with increasing voxels size. (c) shows the change of video memory with the increase of voxels.

Download Full Size | PDF

4. Experiments

In this section, to evaluate the performance of the presented virtual view synthesis for the 8K 3D light field display, we describe the building of a sparse viewpoint dataset based on occlusion scenes. We performed quantitative and qualitative analyses of the experimental results and evaluated the view synthesis quality and speed performance of our method. In the experiments, we first introduce the self-build dataset, the shooting scene, and other information. Then, we tested the algorithm’s performance on the three datasets, including the view synthesis quality and speed. Finally, we display the synthesized 3D image on the 3D light field display to observe the display effect.

4.1 Viewpoint synthesis dataset

This part introduces the public and our sparse viewpoints dataset for virtual view synthesis. Firstly, the images of the self-build sparse viewpoint dataset are captured with a camera in two scenes. The camera can capture 4K resolution images,as shown in Table 1. By changing the camera angle and shooting distance during the shooting sparse views process, the diversity of perspectives and scales of the same scene is built. Some images in the self-build datasets are shown in Fig. 10(a). We can see that the scenes in the dataset are complex. There are small objects, weak texture, complex occlusion, etc. We used the COLMAP to calibrate the dataset for the training and testing network. In addition to this self-build dataset, we also selected 158 pictures of the public sparse views datasetb(real forward-facing and realistic synthetic $360^{\circ }$) to supplement the training and testing set as shown in Fig. 10(b). The datasets contain forward-facing and ring shooting scenes for viewpoint synthesis research.

Fig. 10. Some images of our training and testing dataset, including (a) a self-build sparse viewpoint dataset, where the images were captured with a camera,(b) the public sparse views dataset (Real Forward-Facing and Realistic Synthetic $360^{\circ }$).

Download Full Size | PDF

Table 1. The camera and lens parameters

View Table | View all tables in this article

4.2 View synthesis

In this section, we chose several typical scenes from three datasets to synthesize virtual views for comparative experiments. First, we selected the horns scene to evaluate the effectiveness of the proposed cutoff sampling method. Then, the proposed algorithm is compared with the mainstream virtual view synthesis algorithms, including MPI-based LLFF, NeRF. SSIM and PSNR are used to evaluate the synthesis view quality, and the frame rate (FPS) is used to evaluate the synthesis view speed.

To evaluate the effectiveness of the proposed cutoff sampling method, we selected the horns scene to do ablation comparison experiments, and the experimental results are shown in Fig. 11. We compare the quality and effect of viewpoint synthesis before and after adding the cutoff method to NeRF and our method. It can be observed from Fig. 11 that the PSNR and SSIM values of NeRF with cutoff are higher than NeRF without cutoff. At the same time, the virtual view quality of our method with cutoff is higher than our method without cutoff. From the zoomed-in details of the virtual view in Fig. 11, it can be observed that the NeRF with cutoff shows more details than NeRF without cutoff. In particular, the virtual view of NeRF with cutoff clearly shows the white vertical line in the middle. The lights in the virtual view of our method with cutoff are smoother than in our method without cutoff. So, the proposed cutoff sampling method can effectively improve the virtual view quality.

Fig. 11. Ablation comparison of the proposed cutoff method. The detailed image corresponds to the red area of the virtual view.

Download Full Size | PDF

Virtual views of four different scenes are synthesized, and image details are presented, as shown in Fig. 12. The left images are the synthesized virtual view by our method. The right images are the ground-truth images and the synthetic views of different algorithms at the positions of the red and yellow rectangles. At the same time, the figure shows each algorithm’s SSIM, PSNR, and FPS. For the outdoor weak texture flower scene, we can see that the synthetic view FPS of the proposed algorithm is much higher than the other algorithms. The FPS of the proposed algorithm reaches 487 FPS at the $1008 \times 756$ resolution, which is almost 8000 times that of the NeRF. Moreover, the quality of the virtual view synthesized by our algorithm is higher than that of the LLFF method, and the virtual view synthesized by the LLFF method is more blurry. The quality of the proposed method is lower than the NeRF method, but the visualization effect is almost the same. Our method can observe the details of the flower. For the simulation model LEGO scene, our algorithm can well reconstruct the content distribution from sparse viewpoints, in which PSNR is about 29.1037 dB, and SSIM is about 0.9321. The virtual view synthesized by the LLFF method has serious missing content. The quality of virtual views synthesized by our method is similar to NeRF, but the speed of view synthesis is much faster than the NeRF. The proposed algorithm can quickly synthesize high-quality virtual views on public datasets.

Fig. 12. Results for outdoor weak texture flower scene, simulation model LEGO scene, indoor complex occlusion teddy bear and camera bracket scene. The synthetic view quality of our method is similar to NeRF, and the synthesis speed is much faster than other methods.

Download Full Size | PDF

For the self-build indoor complex occlusion Teddy bear and Camera bracket scene, it can be seen that our method is able to synthesize high-quality virtual views of small objects. In the Teddy bear scene, the PSNR of our method is lower than NeRF, but the SSIM is almost similar. We can clearly observe the text in the virtual view and the texture information on the bear. In the Camera bracket scene, we compare the viewpoint synthesis quality under small objects. The proposed method can rapidly synthesize virtual views of small objects. Although its PSNR is lower than NeRF, the SSIM has reached the requirements of the 3D light field display. Overall, the proposed algorithm can quickly synthesize high-quality virtual views on the self-build datasets. The synthesizing views speed of the proposed method is outstanding over other methods because it represents the scene content through voxel and utilizes the GPU to synthesize virtual views in parallel. Here, parallel computing and viewpoint synthesis are combined to improve viewpoint synthesis speed. Additionally, to improve the quality of the synthetic view, our method reasonably sets the voxel grid size and the interpolation method according to the simulation results.

4.3 Presenting on 3D light-field display

The virtual viewpoint generation method can generate a series of virtual views with horizontal parallax by adjusting the camera pose for the 3D light field display. Here, we compare the synthesis speed of our method and other existing methods, as shown in the Table 2. It can be seen that the synthesizing 3D image speed of our method outperforms other methods. Because other methods need to generate a series of views in advance before synthesizing 3D images, our method can directly synthesize 3D images by the 3D voxel-based off-axis pixel coding method. At 1920 $\times$ 1080 resolution, the proposed method synthesizes 3D images in milliseconds. At 3840 $\times$ 2160, 5120 $\times$ 2880 and 7680 $\times$ 4320 resolution, the proposed method synthesizes 3D images in seconds. Here the efficiency of high-resolution 3D images drops. This is because when the GPU memory occupancy reaches a level, the performance will be reduced.

Table 2. Synthesizing different resolutions 3D images speeding using different method

View Table | View all tables in this article

Some 3D images are presented in our innovative 8K 3D light field as shown in Fig. 13. The 3D images of the flower and horns scenes are used for the 3D light field display. From the 3D light display, we can observe the details in scenes, correct occlusion, and smooth motion parallax. In addition, the epipolar plane images (EPIs) are computed to demonstrate the smooth and clear pixels at different synthesized views, as shown in Fig. 14. Experimental results indicate that the proposed method can fast synthesize high-quality 3D encoding images for the 3D light field display, which solves the problem of dense viewpoint collection in the 3D light field display.

Fig. 13. The results of 3D light-field display (see Visualization 1).

Download Full Size | PDF

Fig. 14. Horizontal EPIs of different scenes.

Download Full Size | PDF

5. Conclusion

In summary, a two-stage method is proposed to synthesize virtual view for 8k 3D light field display based on cutoff-NeRF and 3D voxel rendering. In the first stage, the cutoff-NeRF method is proposed to improve the quality of synthesized views. In the second stage, a 3D voxel-based rendering and coding algorithm is proposed to quantify cutoff-NeRF to synthesize high-resolution and high-quality 3D images quickly. Among them, a coarse-to-fine voxel rendering method is proposed to improve the quality of synthesis views and the rendering speed. The 3D voxel-based off-axis pixel encoding method is proposed to speed up 3D image synthesis. Additionally, a sparse views dataset is built by ourselves to analyze the effectiveness of the proposed method. The experimental results demonstrate the effectiveness of the proposed method. PSNR of the virtual views is about 29.75 dB, SSIM is about 0.88, and the synthetic 8K 3D image time is about 14.41s. Our method can quickly synthesize high-quality 3D images for 3D light fields in complex scenes. In the future, I believe our method will be widely applied to content synthesis for 3D light field displays.

Funding

National Key Research and Development Program of China (2021YFB2802300); National Natural Science Foundation of China (62175017, 62075016).

Acknowledgments

LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this material.

Disclosures

The authors declare no conflicts of interest. This work is original and has not been published elsewhere.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. X. Sang, X. Gao, X. Yu, S. Xing, Y. Li, and Y. Wu, “Interactive floating full-parallax digital three-dimensional light-field display based on wavefront recomposing,” Opt. Express 26(7), 8883–8889 (2018). [CrossRef]

2. N. Balram and I. Tošić, “Light-field imaging and display systems,” Inf. Disp. 32(4), 6–13 (2016). [CrossRef]

3. X. Gao, X. Yu, X. Sang, L. Liu, and B. Yan, “Improvement of a floating 3d light field display based on a telecentric retroreflector and an optimized 3d image source,” Opt. Express 29(24), 40125–40145 (2021). [CrossRef]

4. R. Ng, M. Levoy, M. Brédif, G. Duval, M. Horowitz, and P. Hanrahan, “Light field photography with a hand-held plenoptic camera,” Ph.D. thesis, Stanford University (2005).

5. D. Liu, X. Huang, W. Zhan, L. Ai, X. Zheng, and S. Cheng, “View synthesis-based light field image compression using a generative adversarial network,” Inf. Sci. 545, 118–131 (2021). [CrossRef]

6. B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy, “High performance imaging using large camera arrays,” ACM Trans. Graph. 24(3), 765–776 (2005). [CrossRef]

7. H. Kawasaki, K. Ikeuchi, and A. Sakauchi, “Light field rendering for large-scale scenes,” in Computer Vision and Pattern Recognition, IEEE, vol. 2 (2001), p. II.

8. G. Chaurasia, O. Sorkine, and G. Drettakis, “Silhouette-aware warping for image-based rendering,” in Computer Graphics Forum, vol. 30 (Wiley Online Library, 2011), pp. 1223–1232.

9. L. Ballan, G. J. Brostow, J. Puwein, and M. Pollefeys, “Unstructured video-based rendering: Interactive exploration of casually captured videos,” in ACM SIGGRAPH, (2010), pp. 1–11.

10. Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep learning for monocular depth estimation: A review,” Neurocomputing 438, 14–33 (2021). [CrossRef]

11. A. Sagar, “Monocular depth estimation using multi scale neural network and feature fusion,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, (2022), pp. 656–662.

12. K. Fu, J. Peng, Q. He, and H. Zhang, “Single image 3d object reconstruction based on deep learning: A review,” Multimed. Tools Appl. 80(1), 463–498 (2021). [CrossRef]

13. J. Chibane, A. Bansal, V. Lazova, and G. Pons-Moll, “Stereo radiance fields (srf): Learning view synthesis for sparse views of novel scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), pp. 7911–7920.

14. O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson, “Synsin: End-to-end view synthesis from a single image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 7467–7477.

15. B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar, “Local light field fusion: Practical view synthesis with prescriptive sampling guidelines,” ACM Trans. Graph. 38(4), 1–14 (2019). [CrossRef]

16. J. Flynn, M. Broxton, P. Debevec, M. DuVall, G. Fyffe, R. Overbeck, N. Snavely, and R. Tucker, “Deepview: View synthesis with learned gradient descent,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 2367–2376.

17. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European conference on computer vision, (2020), pp. 405–421.

18. J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 5855–5864.

19. A. Yu, V. Ye, M. Tancik, and A. Kanazawa, “pixelnerf: Neural radiance fields from one or few images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), pp. 4578–4587.

20. T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” ACM Trans. SIGGRAPH 37(4), 1–12 (2018). [CrossRef]

21. P. P. Srinivasan, R. Tucker, J. T. Barron, R. Ramamoorthi, R. Ng, and N. Snavely, “Pushing the boundaries of view extrapolation with multiplane images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 175–184.

22. D. Chen, X. Sang, P. Wang, X. Yu, X. Gao, B. Yan, H. Wang, S. Qi, and X. Ye, “Virtual view synthesis for 3d light-field display based on scene tower blending,” Opt. Express 29(5), 7866–7884 (2021). [CrossRef]

23. D. Chen, X. Sang, P. Wang, X. Yu, B. Yan, H. Wang, M. Ning, S. Qi, and X. Ye, “Dense-view synthesis for three-dimensional light-field display based on unsupervised learning,” Opt. Express 27(17), 24624–24641 (2019). [CrossRef]

24. H. Wang, B. Yan, X. Sang, D. Chen, P. Wang, S. Qi, X. Ye, and X. Guo, “Dense view synthesis for three-dimensional light-field displays based on position-guiding convolutional neural network,” Opt. Lasers Eng. 153, 106992 (2022). [CrossRef]

25. J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 4104–4113.

Specification (camera)	Parameter (camera)	Specification (lens)	parameter (lens)
Resolution	4032 $\times$ 3024 pixels	Focal Length	$26$ mm, $14$ mm
ISO	$20 - 1600$	Angle of View	$25^{\circ} - 120^{\circ}$
Shutter Time	0.5s-15s	Aperture	$f / 1.6, f / 2.4$

3D image resolution	LLFF	NeRF	Our method
1920 $\times$ 1080	6.06s	261.12s	0.08s
3840 $\times$ 2160	65.64s	1034.88s	2.91s
5120 $\times$ 2880	151.36s	1844.16s	5.47s
7680 $\times$ 4320	517.29s	4136.64s	14.41s

Specification (camera)	Parameter (camera)	Specification (lens)	parameter (lens)
Resolution	4032 $\times$ 3024 pixels	Focal Length	$26$ mm, $14$ mm
ISO	$20 - 1600$	Angle of View	$25^{\circ} - 120^{\circ}$
Shutter Time	0.5s-15s	Aperture	$f / 1.6, f / 2.4$

3D image resolution	LLFF	NeRF	Our method
1920 $\times$ 1080	6.06s	261.12s	0.08s
3840 $\times$ 2160	65.64s	1034.88s	2.91s
5120 $\times$ 2880	151.36s	1844.16s	5.47s
7680 $\times$ 4320	517.29s	4136.64s	14.41s

Fast virtual view synthesis for an 8K 3D light-field display based on cutoff-NeRF and 3D voxel rendering

Abstract

1. Introduction

2. Method

2.1 Cutoff-neural radiance fields

2.2 Scene quantitative representation based on 3D voxel

2.3 3D voxel-based off-axis pixel encoding

3. Implementation and analysis

3.1 Method implementation

3.2 Analysis

4. Experiments

4.1 Viewpoint synthesis dataset

4.2 View synthesis

4.3 Presenting on 3D light-field display

5. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (14)

Tables (2)

Equations (10)

Optics Express