Real-time light-field generation based on the visual hull for the 3D light-field display with free-viewpoint texture mapping

Zeyuan Yang; Xinzhu Sang; Binbin Yan; Binbin Yan; Duo Chen; Peng Wang; Huaming Wan; Shuo Chen; Jingwen Li

doi:10.1364/OE.478853

1. Introduction

In recent years, three-dimensional (3D) light-field displays, which can present a large viewing angle and dense viewpoints in high resolution, have attracted many attentions [1–4]. The 3D light-field display can provide detailed 3D scenes with proper occlusion when observers view the 3D world. It is an advanced display method that has great capacity in the fields of education, VR, and medical imaging. In the mass, the approaches to recovering 3D light-field content contain rendering-based methods [5–10], dense view synthesis [11–24], and 3D reconstruction [25–31]. However, it is still a challenge to perform the sensible 3D model in real-time.

The rendering-based methods [5–10] adopt programmable graphics pipelines to render a constructed 3D model for the 3D light-field display. The previous study of multiple-view rendering (MVR) [5] can produce multiple perspectives of artificial models at a high speed. A novel high-efficient computer-generated integral imaging (CGII) method [6] is presented to accomplish this task based on the backward ray-tracing technique [7,8]. The methods MDIBR [9] and PMR [10] can achieve real-time rendering by directly calculating the final 3D image. The rendering result of these methods is at a higher resolution and frame rate, but the target models only represent the virtual model rather than the realistic scene, which are acquired before the rendering task. In addition, the display of dynamic scenes based on the above methods requires various preprocessed 3D models in each frame, such as a moving human body, which introduces lots of manual work.

The view synthesis approaches [11–24] focus on coding dense virtual viewpoints between the captured sparse viewpoints for the 3D light-field display. Chan et al. proposed the Image-Based Rendering (IBR) [11] method to generate a synthesized image by warping the captured viewpoint on the desired viewing position without 3D reconstruction. A fast view synthesis method [12] is developed which can achieve multi-view synthesis using GPU, but the accurate depth map is hard to acquire. In recent years, the convolutional neural network (CNN) has been applied to view synthesis tasks due to its powerful feature extraction abilities. MPI methods [13–15] can synthesize novel views of static scenes by accumulating an RGB image and an alpha map at each fixed depth plane. Multi-Parallax View Net (MPVN) [16] used an unsupervised learning method, which could synthesize high-quality virtual views using a color tower and a selection tower to optimize the output image by per-pixel weight summing. The restriction of the camera distribution is gotten rid of which could handle randomly distributed viewpoints [17]. A virtual view synthesis method [18] based on scene tower blending was proposed to improve the quality and consistency of generated views. View selection strategies of the input image were demonstrated that they could simultaneously improve the quality and efficiency of the virtual view [19]. However, the proper viewing distance is restricted to a small area based on the distribution of the RGB camera array. Other CNN-based methods, such as Nerf [20], PixelnerF [21], Nerf++ [22], Nerfies [23], cross-view image synthesis [24], etc., can synthesize novel views with high quality. For instance, PixelnerF [21] applied an independent rendering pipeline compared with the above methods, which means rendering pipelines can be used in the light-field image coding process. However, the CNN-based methods need a training process for different scenes which cannot achieve real-time for captured dynamic scenes.

The 3D reconstruction [25–31] algorithms are based on multi-viewpoints to restore the geometric structure of the natural and realistic scenes for the 3D light-field display. Various proper viewing distances can be obtained in the rendering process by adjusting the position of the virtual camera array. The frame rate is still essential for these methods. Previous studies of the voxel visual hull [25,26] using multi-view silhouettes to reconstruct the target 3D model were aimed at a small scene with a close capturing distance. The CUDA toolkit was introduced to the visual hull [27–29] to realize the real-time. Based on previous research, the GPU-based visual hull algorithm (GIBVH) [29] can deal with uniformly distributed views to reconstruct the voxel model with a platform in a small scene. However, it is still difficult to achieve multi-camera calibration of real scenes with randomly distributed cameras in practical applications. Deep learning-based methods, such as DoubleField [30] and DeepMultiCap [31], can achieve implicit 3D model generating and single-view dynamic image rendering by predicting a trained neural network. However, the image generated with such methods does not reach enough frame rate to provide a large view angle with dense views.

Here, the Light-field Visual Hull (LVH) is proposed to realize the continuous real-time 3D light-field display of the dynamic real scenes. The synthetic image can be directly generated by computing the essential 3D points and textures instead of producing all synthesized views or the entire 3D model. Every subpixel of the synthetic image corresponds to a specific rendering ray launched by a virtual camera, and the virtual camera array is settled based on the light-field image coding with adjustable viewing distance. Each emitted ray is marching to search the 3D intersection, which is the surface point of the model, during the carving process in each step based on the ray-casting technique. Specifically, the 3D frontmost point of the marching ray is projected into captured views to determine whether it belongs to the model at every step, and the color of each subpixel is derived from the projected intersection in the selected views after the 3D intersection is found. CUDA is sued to accelerate the heavy calculation. The free-viewpoint texture mapping method is presented to obtain higher quality in each virtual viewpoint. Besides, a multi-camera calibration method is proposed using synchronized frames based on an implemental calibration strategy. In our experiment, an RGB camera array containing forty Blackmagic cameras is built to acquire synchronized multi-view videos whose resolution is 6480$\times$4608 and frame rate is 25fps.

2. Proposed method

The overall concept of our method is shown in Fig. 1, which include light-field capture with data preprocessing and real-time 3D light-field imaging. The RGB cameras are randomly distributed with different postures, and the focal plane of each camera is aligned to the center of the 3D space. All the RGB cameras are calibrated with the proposed multi-view calibration algorithm. For each frame of the multi-view video in data preprocessing, the camera undistortion, color calibration, and human segmentation tasks were proceeded on the server. The target model is segmented for LVH based on Robust High-Resolution Video Matting (RVM) [32]. As shown in Fig. 1(a), the synchronized multi-view videos are captured and recorded with Open Broadcaster Software (OBS) [33]. The real-time 3D light-field imaging process is shown in Fig. 1(b). A virtual camera array, where the virtual cameras are uniformly distributed in an off-axis manner, is settled to reconstruct and render the needed 3D model points and subpixels for 3D light-field display. The location and direction of each virtual camera are settled based on the multi-view image encoding. Owing to the 3D point reconstruction process in LVH, the proper viewing distance can be easily achieved by adjusting the virtual camera array. Generally speaking, LVH can generate synthetic images for 3D light-field display in real-time.

Fig. 1. The offline light-field capture and the real-time 3D light-field imaging process. (a) The captured multiple viewpoints with free-viewpoint RGB cameras are shown in the red rectangle area. (b) The virtual viewpoints rendered with the virtual cameras are shown in the blue one. In practice, the synthetic image for 3D imaging is directly generated instead of rendering each specific viewpoint.

Download Full Size | PDF

2.1 Multi-view calibration

In the RGB camera array, there are different intrinsic, extrinsic, and distortion parameters for each camera. The intrinsic parameters of all the cameras are calibrated with radial and tangential distortion [34]. Here, a multi-frame extrinsic calibration method by dividing camera groups is proposed to meet the accuracy of 3D reconstruction.

First of all, the initial value of each camera extrinsic is defined based on Perspective-n-Point (PnP) [35]. It is guaranteed that the front of the checkerboard can be viewed by all the cameras, so more accurate chessboard corners can be detected. But the distribution of spatial points is limited to a narrow range, so global optimization is not performed at this step. As shown in Fig. 2(a), forty synchronized images corresponding to forty real cameras are recorded. The 3D locations of the inner corners are settled manually according to the real size of the calibration board, and the 2D coordinates are detected based on the chessboard corner detection. As the correspondence between 3D-2D feature points is known, the initial camera postures can be solved with the PnP method.

Fig. 2. The proposed incremental camera calibration method. (a) Captured image for calibration from $V_{1}$ to $V_{40}$ which can be viewed by all cameras. The origin of the 3D world coordinate ${O}$, the $x$-axis, and the $y$-axis are settled to be the horizontal and vertical direction in the calibration board, respectively. (b) The multi-frame calibration method based on camera groups. Firstly, the blue curve represents the camera groups $G_{1}$ and $G_{2}$, which are firstly calibrated. Next, the green curve implies the camera group $G_{3}$ is calibrated with fixed parameters in $G_{2}$. Finally, the red curve means the overall optimization of $G_{1}$, $G_{2}$ and $G_{3}$.

Download Full Size | PDF

Next, the postures are incrementally optimized by Bundle Adjustment (BA) [36]. Every four cameras are divided into a group, such as $G_{1}$, $G_{2}$, $\cdots$, $G_{10}$. The purpose of dividing camera groups is to increase the distribution of the observed 3D points separately so that locally optimal solutions can be avoided. The multiple synchronized frames are recorded between every two contiguous groups, such as cameras in $G_{1}$ and $G_{2}$, $G_{2}$ and $G_{3}$, respectively. Compared with the previous forty synchronized images, in which each camera only records one image, the multiple synchronized frames for each camera contain seventy chessboard images. The checkerboard is progressively moved in various depth planes during the acquisition process, and random rotation and translation movement are adopted to introduce more 3D constraints. For the first two camera groups $G_{1}$ and $G_{2}$, as shown in Fig. 2(b), the triangulation is proceeded to reconstruct 3D points with matched 2D feature points in multi-frames. Then BA is adopted to minimize the projection error, and the camera poses in $G_{1}$, $G_{2}$ are refined. For the next two camera groups $G_{2}$ and $G_{3}$, cameras in $G_{3}$ are calibrated with fixed postures of $G_{2}$. Other cameras in $G_{4}$, $G_{5}$, $\cdots$, $G_{10}$ are incrementally calibrated by the same operation. During incremental optimization, the extrinsic of each camera and all the triangulated 3D points are optimized with Huber loss, and each camera’s intrinsic is set to be constant. The postures of the camera array are calibrated more precisely than the initial results.

2.2 Color calibration

The color calibration process is used to adjust the color response of each camera to a standard color space since the color of a certain point among viewpoints presents inconsistency due to the lighting conditions and the differences between cameras. A standard color calibration board, which contains twenty-four common colors in a specific (RGB) color space, is used as the calibration target. As shown in Fig. 3(a), the Macbeth color checkerboard is captured in all viewpoints [37]. A smaller chessboard is also placed to locate each color block more precisely. All viewpoints are warped to fetch the color blocks conveniently. Twenty-four average colors of the color blocks are calculated and compared with the standard calibration target in each viewpoint based on Color Correction Model (CCM) [38]. The color correction result is shown in Fig. 3(b). In our experiment, each camera is independently calibrated with a color correction matrix to fit the standard calibration target.

Fig. 3. The color calibration process in our experiment. (a) The squares in the color calibration board are detected in each viewpoint. (b) The color calibration loss in each view where higher loss means larger calibration error.

Download Full Size | PDF

2.3 Overall approach of the light-field visual hull

In general, the color of each subpixel in the synthetic image is calculated in one pass by rendering ray definition, ray propagation, and color generation for the 3D light-field display. To improve efficiency, backward ray tracing is adopted in LVH. In this way, only the essential 3D points and colors that contribute to the synthetic image are computed, thereby a real-time application is guaranteed. Taking a non-rigid 3D human body as an example, the overall approach of LVH is shown in Fig. 4. Firstly, the virtual camera array is arranged based on off-axis light-field image coding. The location of each spatial point on the model is solved in parallel during the ray propagation. Next, the texture of the model point that can be observed with the virtual camera is calculated. Finally, all the blended textures are mapped to the corresponding subpixels, and the synthetic image is generated. On the other hand, if the forward processing method is applied in the task, the reconstruction of an entire model, rendering all elemental images, and generating the synthetic image should be performed in sequence, which costs lot of computing time.

Fig. 4. The overall approach of LVH. The red arrows and the blue arrows represent the ray propagation and variable calculation process, respectively. (a) For each subpixel, its corresponding viewpoint number $m(i,j,k)$, $m$-th virtual camera position $P_{v}^{m}$, direction $ \overrightarrow {D_{dir}}$, and rendering ray $ \overrightarrow {R_{ray}}$ are defined. (b) The propagation process of $ \overrightarrow {R_{ray}}$. To locate the intersection point $P_{0}$ of the human body, the frontmost point $P_{w}$ of the ray is propagated forward with 3D point perspective projection. The parameters of the free-viewpoint cameras are calibrated as the method in Section 2.1. (c) The texture of $P_{0}$ is calculated by free-viewpoint texture mapping, so the color of the target subpixel is generated for the 3D light-field display.

Download Full Size | PDF

2.4 Rendering ray definition

The purpose of the ray definition module is to arrange each virtual camera based on light-field encoding, so as to define each emitted rendering ray. Each subpixel-viewpoint index of the synthetic image is solved at first. The index value is the sequence number of elemental images, which also represents the index of corresponding virtual cameras. Next, the position and direction of the ray are decided according to the index. After that, the rendering ray is projected into the 3D volume space.

As shown in Fig. 4(a), for each subpixel $(i,j,k)$ in the synthetic image, a corresponding ray $ \overrightarrow {R_{ray}}(i,j,k,m)$ is launched from the virtual camera position $P_{v}^{m}$ with the ray direction $ \overrightarrow {D_{dir}}$. $M$ virtual cameras $C_{v}^{1}$, $C_{v}^{2}$, $\cdots$, $C_{v}^{M}$ whose positions are $P_{v}^{1}$, $P_{v}^{2}$, $\cdots$, $P_{v}^{M}$ with the same spacing distance $d_{v}$, are arranged in the same 3D world coordinate with the RGB camera array. To define the rendering ray $ \overrightarrow {R_{ray}}$, the subpixel-viewpoint index $m(i,j,k)$ of the currently calculated subpixel is firstly determined based on light-field image coding. $m(i,j,k)$ is denoted by,

(1)$$m\left( {i,j,k} \right) = \left\lfloor {\left( {\frac{{3i + 3j\tan \alpha + k}}{L} - \left\lfloor {\frac{{3i + 3j\tan \alpha + k}}{L}} \right\rfloor } \right) \cdot M} \right\rfloor,$$

where $i \in \left [ {0,{w_{syn}} - 1} \right ]$, $j \in \left [ {0,{h_{syn}} - 1} \right ]$, $k \in \left [ {0,2} \right ]$, $w_{syn}$ and $h_{syn}$ are the width and height of the synthetic image, $L$ is the width of the lenticular lens, $\alpha$ is the slant angle, $M$ is the number of all the off-axis virtual cameras. The position of the $m$-th virtual camera is defined by the following expression,

(2)$$\begin{aligned} \overrightarrow {P_v^m} & = \overrightarrow {P_v^c} - d\left( m \right) \cdot \overrightarrow {{V_0}} \\ & = \overrightarrow {P_v^c} - \left( {m\left( {i,j,k} \right) - {{\left( {M + 1} \right)} \mathord{\left/ {\vphantom {{\left( {M + 1} \right)} 2}} \right. } 2}} \right){d_v} \cdot \overrightarrow {{V_0}} , \end{aligned}$$

where $m \in \left [ {1,M} \right ]$, $P_{v}^{c}$ is the center location of the virtual camera array defined by the proper viewing distance. The second term of the expression represents the distance offset $d\left ( m \right )$ along the direction $\overrightarrow {{V_0}}$ of the $m$-th virtual camera. $ \overrightarrow {P_v^c}$ and $ \overrightarrow {P_v^m}$ are the vectors defined from the origin point $O$ , to the locations $P_{v}^{c}$ and $P_{v}^{m}$ , respectively. As shown in Fig. 4(a), the direction $ \overrightarrow {D_{dir}}$ is a superposition of three mutually perpendicular vectors $ \overrightarrow {U}$, $ \overrightarrow {V}$, $ \overrightarrow {W}$. These vectors are defined as NVIDIA OptiX pinhole camera model [39], which are elaborated in Section 2.7. The direction of each ray is defined as follows,

(3)$$\overrightarrow {{D_{dir}}} = {\hat U_{unit}}\left( {\left( {1 - \frac{{2\left( {j + 1} \right)}}{{{h_{syn}}}}} \right) \cdot \overrightarrow U + \left( {\frac{{2\left( {i + 1} \right)}}{{{w_{syn}}}} - 1} \right) \cdot \overrightarrow V + \overrightarrow W - d\left( m \right) \cdot \overrightarrow {{V_0}} } \right),$$

where the function $\hat U_{unit}$ calculates the unit vector of the input vector, and $ \overrightarrow {U_{0}}$, $ \overrightarrow {V_{0}}$, $ \overrightarrow {W_{0}}$ are the direction vector of $ \overrightarrow {U}$, $ \overrightarrow {V}$, $ \overrightarrow {W}$, respectively. The coefficients of $ \overrightarrow {U}$ and $ \overrightarrow {V}$ contain the unification of the 3D world coordinate and the OpenGL rendering coordinate. The final expression of each rendering ray is defined as the following expression,

(4)$$\overrightarrow {{R_{ray}}} \left( {i,j,k,m} \right) = \overrightarrow {P_v^m} + {d_0} \cdot \overrightarrow {{W_0}} + \left( {{r_0}{\rm{ + }}{n_r} \cdot {r_{step}}} \right) \cdot \overrightarrow {{D_{dir}}} ,$$

where the $d_{0}\cdot \overrightarrow {{W_0}}$ item can manipulate the focal plane of the virtual camera array conveniently. The ray $\overrightarrow {{R_{ray}}}$ diffuses in the direction $ \overrightarrow {D_{dir}}$ with initial length $r_{0}$, step $r_{step}$, and searching step number $n_{r}$.

2.5 Ray propagation based on the visual hull

The aim of the ray propagation module is to find the visible intersection point between the rendering ray and the 3D model based on visual hull. The 2D projected coordinates of the intersection in each viewpoint are determined using perspective projection, which will be applied to subsequent texture mapping module. After the ray definition, the rendering ray is launched from the origin point to the 3D volume space. It is judged along the ray in each step whether the frontmost point intersects with the model surface. If all the 2D projection points are in the corresponding silhouettes for the first time, the 3D intersection point is obtained.

As shown in Fig. 4(b), there are $N$ free-posed RGB cameras $C_{r}^{1}$, $C_{r}^{2}$, $\cdots$, $C_{r}^{N}$, which positions are $P_{r}^{1}$, $P_{r}^{2}$, $\cdots$, $P_{r}^{N}$. Synchronized frames $I_{1}$, $I_{2}$, $\cdots$, $I_{N}$ are captured by $N$ free-viewpoint cameras. The silhouettes of the target model are $S_{1}$, $S_{2}$, $\cdots$, $S_{N}$ in the viewpoints, which are segmented during data capturing process. The frontmost point of the ray is defined as $P_{w}$ in the 3D world coordinate, which marches and judges whether it belongs to the human surface in each step. $P_{w}$ is described as the intersection $P_{0}$ after confirming the ray intersects the model surface. To judge the location of the frontmost point, $P_{w}$ is projected to each captured viewpoint. In the experiment, using the calibrated camera parameters in Section 2.1, the expression of 3D point perspective projection is given by,

(5)$$P_{uv}^n = {{{K_n} \cdot ({R_n} \cdot {P_w} + {T_n})} \mathord{\left/ {\vphantom {{{K_n} \cdot ({R_n} \cdot {P_w} + {T_n})} {{Z_n}}}} \right. } {{Z_n}}},$$

where $K_{n}$ is the calculated intrinsic matrix, and $R_{n}$, $T_{n}$ are the extrinsic matrix of the $n$-th camera. All the 2D projection points $P_{uv}^{1}(u_{1},v_{1},1)$, $P_{uv}^{2}(u_{2},v_{2},1)$, $\cdots$, $P_{uv}^{N}(u_{N},v_{N},1)$ are computed in this way. Afterward, function $f_{n}(P_{uv}^{n})$ is defined by,

(6)$${f_n}\left( {P_{uv}^n} \right) = \left\{ \begin{array}{l} 1,P_{uv}^n \in {S_n}\\ 0,otherwise \end{array} \right.,u \in \left[ {1,{w_r}} \right],v \in \left[{1,{h_r}}\right].$$

When the $n$-th projection point $P_{uv}^n \in {S_n}$, function ${f_n}\left ( {P_{uv}^n} \right )$ equals 1. If the front point $P_{w}$ is a part of the model, each 2D projected point must be inside the silhouette in each captured viewpoint. In hence, the function $g(P_{w})$ is given by,

(7)$$g\left( {{P_w}} \right) = \prod\nolimits_{n = 1}^N {{f_n}\left( {P_{uv}^n} \right)}.$$

which is the continuous multiplication of all the projected results ${f_n}\left ( {P_{uv}^n} \right )$. When the ray $ \overrightarrow {R_{ray}}$ reaches the surface of the model, the function $g(P_{w})$ equals 1 for the first time. Now the intersection $P_{w}$ is described as $P_{0}$. Conversely, if a ray $ \overrightarrow {R_{ray}}$ never hit the visual hull, which means equals 0 in every marching step, the color of the subpixel is recorded as the background color.

2.6 Color generation based on free-viewpoint texture mapping

The objective of color generation is to calculate the texture of the observed intersection point in real-time. The challenges are that the real camera array is randomly distributed. The irregular target model with a certain thickness is not necessarily in the center of the 3D space. To handle these situations, a dynamic free-viewpoint texture mapping method is proposed based on free-viewpoint cameras. As shown in Fig. 4(c), instead of using all the RGB cameras, a smaller camera group $C_r^{{n_s} - a}$, $\cdots$, $C_r^{{n_s}}$, $\cdots$, $C_r^{{n_s} + a}$ is chosen to calculate the final texture.

When the intersection point is observed by each real camera, sometimes the intersection point will be occluded by the model itself. The reason is that the real camera array is randomly distributed and the model has its own thickness. As shown in Fig. 5, occlusion occurs naturally when cameras observe the target point $P_{0}$. In this case, the color observed by this real camera cannot represent the color of the intersection. Therefore, these viewpoints with occlusion needs to be eliminated. Taking the situation in Fig. 5, the locations of RGB cameras are defined as $P_r^{1}$, $P_r^{2}$, $\cdots$, $P_r^{N}$. The actual observation points of camera $C_r^{1}$, $C_r^{2}$, $\cdots$, $C_r^{n}$ are $P_0^{1}$, $P_0^{2}$, $\cdots$, $P_0^{n}$ instead of $P_{0}$ , which means these captured viewpoints cannot be introduced into texture mapping. Besides, the depth information of each real camera with the length of vector $\overrightarrow {{P_0} - P_r^1}$, $\overrightarrow {{P_0} - P_r^2}$, $\cdots$, $\overrightarrow {{P_0} - P_r^N}$, cannot be utilized as well because the location of the target model and real cameras are arbitrary. If it is believed that the closer cameras have a greater weight contribution to the texture, according to the geometric distribution in Fig. 5, the cameras that contribute the most to the texture should be $C_r^{1}$, $C_r^{2}$, $\cdots$, $C_r^{n}$ , which captured views are obviously occluded.

Fig. 5. The proposed free-viewpoint texture mapping method. The black lines represent the occluded ray and the orange lines are the rays without occlusion. The orange rectangle means the selected camera group for texture blending in our method.

Download Full Size | PDF

To handle the impact of occlusion and free-viewpoint cameras, the real and virtual cameras are unified into the same coordinate system firstly. After the 3D positions of all real cameras with camera extrinsic parameters are solved, the correct texture can be determined by the geometric distribution of all the cameras. Vectors $\overrightarrow {{P_0} - P_r^1}$, $\overrightarrow {{P_0} - P_r^2}$, $\cdots$, $\overrightarrow {{P_0} - P_r^N}$ which denotes the viewing direction of RGB cameras are calculated. The angles between $ \overrightarrow {D_{dir}}$ and $\overrightarrow {{P_0} - P_r^1}$, $\cdots$, $\overrightarrow {{P_0} - P_r^N}$ are calculated, and the minimum angle is solved. The vector $ \overrightarrow {D_{dir}}$ represents the viewing direction of the interaction $P_0$ viewed by $m$-th virtual camera $C_v^m$. The real camera which has the minimum viewing angle of $ \overrightarrow {D_{dir}}$ is defined as $C_r^{{n_s}}$. The camera group that is not affected by occlusion and camera distribution can be selected in this way. The camera index $n_{s}$ is defined by the following expression,

(8)$${n_s} = \mathop {\arg \max }_n \left( {\overrightarrow {{D_{dir}}} \cdot {{\hat U}_{unit}}\left( {\overrightarrow {{P_0} - P_r^n} } \right)} \right),n \in \left[ {1,N} \right].$$

The RGB camera positions $P_r^{1}$, $P_r^{2}$, $\cdots$, $P_r^{n}$ are given as follows,

(9)$$P_r^n ={-} R_n^{ - 1}{T_n},n \in \left[ {1,N} \right].$$

After selecting the camera group $C_r^{{n_s} - a}$, $\cdots$, $C_r^{{n_s}}$, $\cdots$, $C_r^{{n_s} + a}$, the color of the subpixel is defined to be the same as the color of the 3D intersection point $P_0$. Since all the 2D coordinates of the projected points $P_{uv}^{1}$, $P_{uv}^{2}$, $\cdots$, $P_{uv}^{N}$ of the intersection $P_0$ are calculated in the perspective projection procedure as formula (5), the pixel color observed by the camera group $C_r^{{n_s} - a}$, $\cdots$, $C_r^{{n_s}}$, $\cdots$, $C_r^{{n_s} + a}$ are easily fetched as $Cl_{-a}$, $\cdots$, $Cl_0$, $\cdots$, $Cl_a$. The color of the subpixel is defined by,

(10)$$Color\left( {i,j,k} \right) = \sum_{q ={-} a}^a {{A_q} \cdot C{l_q}},$$

where $A_{q}$ is the confidence weight. To clarify the relationship of the stated variables, the final color $Color\left ( {i,j,k} \right )$ of the subpixel and the intersection $P_0$ is calculated by the camera group which are selected by $n_s$. The location of $P_0$ is determined by the virtual camera with index $m\left ( {i,j,k} \right )$. The final color of the subpixel $Col(i,j,k)$ is blended by the colors of each selected viewpoint. Finally, all subpixels are computed on three channels for one pixel in the synthetic image.

2.7 Implementation

The vectors of the rendering ray are defined as follows,

(11)$$\overrightarrow W = t - P_v^c,$$

(12)$$\overrightarrow U = \left| {\overrightarrow W } \right| \cdot aspect \cdot {\hat U_{unit}}\left( {\overrightarrow W \otimes \overrightarrow {{V_{up}}} } \right),$$

(13)$$\overrightarrow V = \left| {\overrightarrow W } \right| \cdot \tan \left( {\frac{{vfov}}{2}} \right) \cdot {\hat U_{unit}}\left( {\overrightarrow U \otimes \overrightarrow W } \right),$$

where the viewing target point $t\left ( {{x_t},{y_t},{z_t}} \right )$ decides the direction of the vector $\overrightarrow W$. Vector $\overrightarrow W$, which marches from $P_{v}^{c}$ to $t$ through the voxel, its length is the vertical length from $t$ to the virtual camera array. Vector $\overrightarrow U$ and $\overrightarrow V$ determine the roll-angle of the synthetic image via the up vector $\overrightarrow {{V_{up}}}$, where the vertical field-of-view $vfov$ and $aspect$ are parameters of the virtual cameras. The direction of the vector $\overrightarrow V$ is parallel with the virtual camera array.

In the experiment, the resolution of the 3D volume is $512^{3}$. The number of free-viewpoint cameras $N$ is 40. The number of virtual cameras $M$ is set to $50\sim 80$ for various 3D light-field display. The resolution of each captured frame is 1152$\times$648, and the frame rate is 25fps. The resolution of the synthetic image is 3840$\times$2160. The experiments are accomplished using an Intel Core i9-10900K CPU @ 3.7GHz and a NVIDIA RTX 3090 GPU.

2.8 Rendering pipeline of LVH

The rendering pipeline of our method is shown in Fig. 6. Firstly, the parameters of the 3D light-field display are determined during the parameters initializing process. The preprocessed frames of multi-view cameras are loaded into device memory.

Fig. 6. Flowchart of our method.

Download Full Size | PDF

Secondly, the 3D voxel object and the 3D texture object are created for the rendering process. To interactively adjust the imaging scenes of the target model, the user interactions are initialized in the CPU. Therefore, the position, direction and zero depth plane of the virtual camera array can be easily adjusted during the ray creation and update process, which means the best viewing distance can be settled to perform a better viewing effect. The resolution of each virtual camera is set to 4K, corresponding to the synthetic image size. Thirdly, the rendering rays for ray propagation and color generation, which are defined based on subpixel viewpoint arrangement, are created using ray casting technique. The two-dimensional kernel function is adopted to calculate the subpixels in parallel, and each rendering ray is created and computed in GPU threads. During the process of 3D imaging, the data is processed frame by frame. Specifically, the 3D intersection of the target model is located via ray propagation. Every color of the subpixel is enumerated by perspective projection, texture mapping and color blending. Next, one pixel’s color is calculated after three times ray casting. The GPU threads are synchronized after all pixels of the synthetic image are computed. Finally, the pixel buffer object is used to update and display the newest 3D image in GPU directly. Our proposed method operates at a high frame rate so that a continuous viewing effect can be achieved.

3. Experimental results

3.1 Texture evaluations

To evaluate the accuracy of the computed texture of our method, the virtual cameras are placed at the same position as the RGB cameras with the same viewing direction. In this way, rendered images by virtual cameras could reflect the accuracy of texture compared with the target texture. Rendered viewpoints of view $\#20$ are shown in Fig. 7.

Fig. 7. Rendered viewpoints of different person in view $\#20$. The left two columns are target textures. The middle two columns and the right two columns are our method and GIBVH’s method, respectively. Details are shown in the rectangle areas.

Download Full Size | PDF

Experimental results prove that LVH can reconstruct various models in real-time with forty free-viewpoint cameras. Besides, our method could select the correct camera group for texture mapping. It is found that the depth-based texture calculation method only performs correctly based on uniformly distributed cameras. We conducted the above experiments at all RGB camera positions, and more results from view $\#16$ to view $\#25$ are given in Fig. 8. Besides, middle views between every two adjacent cameras are presented. Comparing our method with the previous method quantitatively, the PSNR of the rendering result in each specific viewpoint is calculated.

Fig. 8. Experimental results of target textures and the rendered views of our method. From top to bottom, figure (a), (b) and (c) are PSNR results on Person 1, Person2, and Person3, respectively. The first and third rows are the input texture, and the second and fourth rows are rendered views. Each synthesized viewpoint is in the middle of two posed views.

Download Full Size | PDF

As a matter of fact, for the color continuity of intermediate virtual views and a better viewing effect, the color blending process is still essential, though it will result in a drop of PSNR. More specifically, the captured colors of a 3D point in different viewing position are not strictly the same naturally owing to the material of the object. Besides, complex illuminations will introduce reflections, refractions and shadows to the observing object, which lead to the color tone changes of the same 3D point. In addition, it has been demonstrated in the experiment that the captured colors of a 3D point cannot be adjusted to be the same through the color calibration. As a consequence, the color blending is adopted into LVH. As shown in the formula (10), the confidence weight $A_{q}$ controls the level of color consistency of virtual viewpoints. If the weight $A_{q}$ rises, the PSNR in each viewpoint will increase, but the color continuity of the virtual views will decline as well. Three captured RGB viewpoints are selected in the experiment depending on the illuminations and free-viewpoint camera array. The adjacent view number $a$ for color blending is 1. $A_{0}$ is set to 0.6, $A_{-1}$ and $A_{1}$ are set to 0.2. In this way, all the rendered virtual viewpoints can achieve better color consistency.

When characters with various poses are dealt with, the PSNR distribution in each view of the two methods is presented in Fig. 9. Due to the perspective projection during ray propagation, the reconstructed model is the common-observed region captured by each camera. The PNSR only counts up the pixels collocated with the area projected from the common-observed region in each viewpoint. Experiments prove that the proposed texture mapping method performs better in each viewpoint. Various factors may affect the PSNR of each specific viewpoint, such as illumination, motion blur, character postures or clothing materials, but our method outperforms the previous method in all conditions. Not limited by the camera array arrangement and character location and poses, our method can improve the PSNR by a minimum of 4.25dB, a maximum of 21.46dB, and an average of 11.88dB in above experiments. The results prove that our method can deal with different viewing conditions by introducing the 3D point reconstruction and the virtual camera array within the viewing area of the RGB camera array.

Fig. 9. PSNR comparisons of rendered views by our method and GIBVH. From left to right, experiments are on Person 1, Person 2 and Person 3, respectively.

Download Full Size | PDF

In the middle of the line charts, from view $\#15$ to view $\#30$, the PNSR distribution of both methods has a certain decrease. It is worth noting that the color correction losses are higher in these viewpoints corresponding to the result of color calibration, as shown in Fig. 3(b). Adjacent viewpoints with large color diversities will produce color changes in the whole image after weighted blending. Therefore, the PSNR of these viewpoints is lower, but it is beneficial to maintain the color consistency of the dense viewpoints for 3D light-field display.

Compared with our method, the PSNR of the blue curve keeps decreasing from view $\#15$, and the error from view $\#30$ to view $\#40$ is more obvious. This is because the RGB camera array is arbitrarily distributed, and the position of the characters moving in the scene is not fixed. Therefore, during the rendering process, the camera that is closer to the model is not necessarily the source of the calculated texture. It proves that the free-viewpoint texture mapping method could correctly employ the free-posed input viewpoints in real-time without limitations.

3.2 Experimental configurations and results

To adapt to a variety of 3D light-field display devices, the RGB camera array can capture about 120 degrees of human motion videos. The number of the virtual cameras are adjusted based on the different 3D light-field display device to present human distribution. In our experiment, a 3D light-field display with a $60^{\circ }$ reliable viewing angle and 4K resolution is used, which the line number is 26.44 and the inclination angle is 0.17, assembled with a 27 inches LCD panel. 60 dense virtual views are rendered in real-time to present an $60^{\circ }$ viewing angle. Figure 10 presents the captured 3D synthetic image during 3D imaging for the demonstration of our proposed method.

Fig. 10. The 3D light-field display used in our experiments and the display results from left $30^{\circ }$, center, to right $30^{\circ }$

Download Full Size | PDF

4. Conclusion

In summary, a real-time Light-field Visual Hull (LVH) method is presented, which can realize light-field data generation of dynamic real-world scenes by end-to-end subpixel calculation with 3D model reconstruction. The overall mathematical reasoning of LVH is demonstrated. The rendering ray corresponding to each subpixel is defined and launched, and the 3D points of the model are located during ray propagation. Next, the free-viewpoint texture mapping method is presented to solve the precise texture based on free-viewpoint RGB cameras. Besides, to improve the efficiency, only the essential 3D points and textures are calculated by the concept of backward ray tracing. At last, all the cameras are well calibrated and undistorted with the proposed multi-view camera calibration method. Experimental results show the validity of our method. 3D synthetic images can be generated to provide a continuous and smooth viewing effect with a large viewing angle at 4K resolution over 25fps. The PSNR of rendered views are improved significantly. Our method can handle the test of different models benefiting from the reliable ray propagation process. Our method can be potentially applied in the study of dynamic scenes with 3D structures for the 3D light-field displays.

Funding

National Key Research and Development Program of China (2021YFB2802203); National Natural Science Foundation of China (62075016, 62175017).

Disclosures

The authors declare no conflicts of interest. This work is original and has not been published elsewhere.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. X. Sang, X. Gao, X. Yu, S. Xing, Y. Li, and Y. Wu, “Interactive floating full-parallax digital three-dimensional light-field display based on wavefront recomposing,” Opt. Express 26(7), 8883–8889 (2018). [CrossRef]

2. X. Yu, X. Sang, S. Xing, T. Zhao, D. Chen, Y. Cai, B. Yan, K. Wang, J. Yuan, C. Yu, and W. Dou, “Natural three-dimensional display with smooth motion parallax using active partially pixelated masks,” Opt. Commun. 313, 146–151 (2014). [CrossRef]

3. X. Yu, X. Sang, X. Gao, D. Chen, B. Liu, L. Liu, C. Gao, and P. Wang, “Dynamic three-dimensional light-field display with large viewing angle based on compound lenticular lens array and multi-projectors,” Opt. Express 27(11), 16024–16031 (2019). [CrossRef]

4. X. Yu, X. Sang, X. Gao, B. Yan, D. Chen, B. Liu, L. Liu, C. Gao, and P. Wang, “360-degree tabletop 3d light-field display with ring-shaped viewing range based on aspheric conical lens array,” Opt. Express 27(19), 26738–26748 (2019). [CrossRef]

5. M. Halle, “Multiple viewpoint rendering,” in Proceedings of the 25th annual conference on Computer graphics and interactive techniques, (1998), pp. 243–254.

6. S. Xing, X. Sang, X. Yu, C. Duo, B. Pang, X. Gao, S. Yang, Y. Guan, B. Yan, J. Yuan, and K. Wang, “High-efficient computer-generated integral imaging based on the backward ray-tracing technique and optical reconstruction,” Opt. Express 25(1), 330–338 (2017). [CrossRef]

7. B.-N.-R. Lee, Y. Cho, K. S. Park, S.-W. Min, J.-S. Lim, M. C. Whang, and K. R. Park, “Design and implementation of a fast integral image rendering method,” in International Conference on Entertainment Computing, (Springer, 2006), pp. 135–140.

8. H. Liao, K. Nomura, and T. Dohi, “Autostereoscopic integral photography imaging using pixel distribution of computer graphics generated image,” in ACM SIGGRAPH 2005 Posters, (2005), pp. 73–es.

9. Y. Guan, X. Sang, S. Xing, Y. Li, and B. Yan, “Real-time rendering method of depth-image-based multiple reference views for integral imaging display,” IEEE Access 7, 170545–170552 (2019). [CrossRef]

10. Y. Guan, X. Sang, S. Xing, Y. Chen, Y. Li, D. Chen, X. Yu, and B. Yan, “Parallel multi-view polygon rasterization for 3d light field display,” Opt. Express 28(23), 34406–34421 (2020). [CrossRef]

11. S. Chan, H.-Y. Shum, and K.-T. Ng, “Image-based rendering and synthesis,” IEEE Signal Process. Mag. 24(6), 22–33 (2007). [CrossRef]

12. H.-C. Shin, Y.-J. Kim, H. Park, and J.-I. Park, “Fast view synthesis using gpu for 3d display,” IEEE Trans. Consumer Electron. 54(4), 2068–2076 (2008). [CrossRef]

13. T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnification: Learning view synthesis using multiplane images,” arXiv, arXiv:1805.09817 (2018). [CrossRef]

14. P. P. Srinivasan, R. Tucker, J. T. Barron, R. Ramamoorthi, R. Ng, and N. Snavely, “Pushing the boundaries of view extrapolation with multiplane images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 175–184.

15. J. Flynn, M. Broxton, P. Debevec, M. DuVall, G. Fyffe, R. Overbeck, N. Snavely, and R. Tucker, “Deepview: View synthesis with learned gradient descent,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 2367–2376.

16. D. Chen, X. Sang, W. Peng, X. Yu, and H. C. Wang, “Multi-parallax views synthesis for three-dimensional light-field display using unsupervised cnn,” Opt. Express 26(21), 27585–27598 (2018). [CrossRef]

17. D. Chen, X. Sang, P. Wang, X. Yu, B. Yan, H. Wang, M. Ning, S. Qi, and X. Ye, “Dense-view synthesis for three-dimensional light-field display based on unsupervised learning,” Opt. Express 27(17), 24624–24641 (2019). [CrossRef]

18. D. Chen, X. Sang, P. Wang, X. Yu, X. Gao, B. Yan, H. Wang, S. Qi, and X. Ye, “Virtual view synthesis for 3d light-field display based on scene tower blending,” Opt. Express 29(5), 7866–7884 (2021). [CrossRef]

19. X. Wang, Y. Zan, S. You, Y. Deng, and L. Li, “Fast and accurate light field view synthesis by optimizing input view selection,” Micromachines 12(5), 557 (2021). [CrossRef]

20. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European conference on computer vision, (Springer, 2020), pp. 405–421.

21. A. Yu, V. Ye, M. Tancik, and A. Kanazawa, “pixelnerf: Neural radiance fields from one or few images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), pp. 4578–4587.

22. K. Zhang, G. Riegler, N. Snavely, and V. Koltun, “Nerf++: Analyzing and improving neural radiance fields,” arXiv, arXiv:2010.07492 (2020). [CrossRef]

23. K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 5865–5874.

24. K. Regmi and A. Borji, “Cross-view image synthesis using geometry-guided conditional gans,” Computer Vision and Image Understanding 187, 102788 (2019). [CrossRef]

25. B. G. Baumgart, “Geometric modeling for computer vision,” Tech. rep. (Stanforn Univ, 1974).

26. A. Laurentini, “The visual hull concept for silhouette-based image understanding,” IEEE Trans. Pattern Anal. Machine Intell. 16(2), 150–162 (1994). [CrossRef]

27. A. Ladikos, S. Benhimane, and N. Navab, “Efficient visual hull computation for real-time 3d reconstruction using cuda,” in 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, (IEEE, 2008), pp. 1–8.

28. S. Hauswiesner, R. Khlebnikov, M. Steinberger, M. Straka, and G. Reitmayr, “Multi-gpu image-based visual hull rendering,” in EGPGV@ Eurographics, (2012), pp. 119–128.

29. S. Abdelhak and B. M. Chaouki, “High performance volumetric modelling from silhouette: Gpu-image-based visual hull,” in 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA), (IEEE, 2016), pp. 1–7.

30. R. Shao, H. Zhang, H. Zhang, M. Chen, Y.-P. Cao, T. Yu, and Y. Liu, “Doublefield: Bridging the neural surface and radiance fields for high-fidelity human reconstruction and rendering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 15872–15882.

31. Y. Zheng, R. Shao, Y. Zhang, T. Yu, Z. Zheng, Q. Dai, and Y. Liu, “Deepmulticap: Performance capture of multiple characters using sparse multiview cameras,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 6239–6249.

32. Z. Zhang, “Camera calibration with one-dimensional objects,” IEEE Trans. Pattern Anal. Machine Intell. 26(7), 892–899 (2004). [CrossRef]

33. OBS, “Open broadcaster software,” https://obsproject.com/.

34. S. Lin, L. Yang, I. Saleemi, and S. Sengupta, “Robust high-resolution video matting with temporal guidance,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, (2022), pp. 238–247.

35. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM 24(6), 381–395 (1981). [CrossRef]

36. B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment–a modern synthesis,” in International workshop on vision algorithms, (Springer, 1999), pp. 298–372.

37. K. Li, Q. Dai, and W. Xu, “High quality color calibration for multi-camera systems with an omnidirectional color checker,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, (IEEE, 2010), pp. 1026–1029.

38. Opencv, “Macbeth chart module,” https://docs.opencv.org/4.5.0/dd/d19/group__mcc.html.

39. OptiX, “Nvidia optix ray tracing engine,” https://docs.nvidia.com/vpi/appendix_pinhole_camera.html.

Real-time light-field generation based on the visual hull for the 3D light-field display with free-viewpoint texture mapping

Abstract

1. Introduction

2. Proposed method

2.1 Multi-view calibration

2.2 Color calibration

2.3 Overall approach of the light-field visual hull

2.4 Rendering ray definition

2.5 Ray propagation based on the visual hull

2.6 Color generation based on free-viewpoint texture mapping

2.7 Implementation

2.8 Rendering pipeline of LVH

3. Experimental results

3.1 Texture evaluations

3.2 Experimental configurations and results

4. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Equations (13)

Optics Express