Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Virtual view synthesis for 3D light-field display based on scene tower blending

Open Access Open Access

Abstract

Three-dimensional (3D) light-field display has achieved a great improvement. However, the collection of dense viewpoints in the real 3D scene is still a bottleneck. Virtual views can be generated by unsupervised networks, but the quality of different views is inconsistent because networks are separately trained on each posed view. Here, a virtual view synthesis method for the 3D light-field display based on scene tower blending is presented, which can synthesize high quality virtual views with correct occlusions by blending all tower results, and dense viewpoints on 3D light-field display can be provided with smooth motion parallax. Posed views are combinatorially input into diverse unsupervised CNNs to predict respective input-view towers, and towers of the same viewpoint are fused together. All posed-view towers are blended as a scene color tower and a scene selection tower, so that 3D scene distributions at different depth planes can be accurately estimated. Blended scene towers are soft-projected to synthesize virtual views with correct occlusions. A denoising network is used to improve the image quality of final synthetic views. Experimental results demonstrate the validity of the proposed method, which shows outstanding performances under various disparities. PSNR of the virtual views are about 30 dB and SSIM is above 0.91. We believe that our view synthesis method will be helpful for future applications of the 3D light-field display.

© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

1. Introduction

Recently, the three-dimensional (3D) light field display has achieved a great improvement, which is considered as one of the most potential 3D display methods. By effectively modulating light directions, the 3D light field display is able to present 3D scenes with large viewing angles and dense viewpoints, and more 3D information can be perceived [1,2]. However, the collection of dense views in the real 3D scene is still a bottleneck for wide applications of the 3D light field display.

Many approaches are proposed to obtain multiple views. Light field camera and dense camera array can be used to capture multiple views directly, such as Lytro and Stanford large camera arrays [3,4]. However, both of them are not suitable for 3D light field collections, because of physical limitations, such as short baselines between views and complex manufacture procedures. View synthesis algorithms associated with sparse views can be used to generate multiple virtual views. Image Based Rendering (IBR) is proposed to synthesize a novel view by warping nearby existed views without 3D reconstruction [5]. Within a narrow viewing angle, this method can represent 3D scenes with realistic virtual views. However, due to the discontinuity of the scene depth, there are severe distortions and blurs on synthetic images when the disparity becomes large.

Deep learning methods have been used for IBR task, by training a convolutional neutral network (CNN) end-to-end to synthesize novel views. Since CNN is very good at feature extraction and reconstruction, these methods are able to generate synthetic views with high quality. One of them interpolates dense novel views of the light field by angular super-resolution method, such as EPI Reconstruction [6]. However, the maximum disparity is too narrow and the scene geometry is not considered. Another method estimates a correspondence map between posed views and the warped images are blended as a novel view, such as DVM [7] and Appearance Flow [8], but the synthetic position is not flexible. Another method optimizes an underlying continuous volumetric scene function with a sparse set of inputs to synthesize novel views, such as NeRF [9], but the network is required to be retrained for each scene. Other methods compute layered scene distributions along depth by plane sweeping, and project them to synthesize the novel view in various ways. For example, DeepStereo normalizes the scene selection tower by softmax and outputs the result by the layer weighted summation [10,11]. MPI outputs the result by the alpha-transparency map accumulation [1214]. And LLFF blends multiple MPIs to decrease occlusion errors [15]. However, these methods need plenty of target views in supervision training, which are always difficult to acquire. MPVN [16] and Dense-View Synthesis methods [17], proposed in our previous works, are able to generate multiple views by unsupervised learning, but separate trainings on each posed view result in inconsistent quality across varying viewpoints.

Here, a method of virtual view synthesis for the 3D light-field display based on scene tower blending is proposed. It can synthesize high quality virtual views with correct occlusions by blending 3D view towers, and provide dense viewpoints on the 3D light-field display with smooth motion parallax. The approach consists of four main stages. Unsupervised learning CNNs with combinatorial input views are used to predict respective input-view towers, and towers of the same viewpoint are fused together. All posed-view towers are blended as a scene color tower and a scene selection tower. Blended scene towers are soft-projected to synthesize novel views with correct occlusions. A denoising network is used to improve the image quality of the final synthesized views. Since 3D scene distributions at different depth planes can be accurately estimated, this method can generate high-quality virtual views. Experimental results demonstrate the validity of the proposed method. PSNR of the virtual views is about 30 dB and SSIM is above 0.91 under various disparities.

The schematic comparison is shown in Fig. 1. In previous view synthesis methods, all posed views are reprojected to the virtual view at different depth planes and input into CNN to compute the view tower, and layered distributions of the target view. However, background estimations of the target view are always disturbed by foregrounds from other viewpoints, which causes severe occlusion errors. LLFF solves these problems by blending multiple MPIs, but it requires extra viewpoints in deep learning pipelines. In the proposed method, posed views are combinatorially input into diverse types of networks. All related posed-view towers are fused and blended together as a complete blended scene tower to decrease estimation errors.

 figure: Fig. 1.

Fig. 1. The schematic comparison between different methods. For example, there are 3 posed views used for view synthesis. (a) is the previous CNN method. (b) is the LLFF method. (c) is the proposed fusion and blending method.

Download Full Size | PDF

2. Proposed method

2.1 Problem of previous CNN methods

For previous all-input-view CNN methods, there are always obvious artifacts on synthesized views, caused by 3D scene occlusion errors [16,17]. In the process, all posed views and their associated features are projected and input into the color network to compute layered distributions at different depth. Since images are convoluted by multiple kernels and summed as new features before ReLU (non-negativity) in CNN, the output color result is always very similar to the refocusing image at that depth, where the refocused regions can be identified by the selection network. As shown in Fig. 2 (a), the previous methods are able to correctly reconstruct the foreground region, but the reconstructed background would be incomplete due to front scene occlusions, which causes the quality degradation of virtual viewpoints. In order to solve this problem, as shown in Fig. 2 (b), posed views are combinatorially input the networks to obtain respective distributions in the proposed method. Related results are fused and blended as a complete and correct scene plane. Thus, the blended scene distributions can be used to synthesize dense views with correct occlusions and consistent quality.

 figure: Fig. 2.

Fig. 2. The schematic comparison between previous methods and the proposed method. (a) There are occlusion errors in background estimation for all-input-view CNN methods. (b) The blended scene distribution is complete and correct for the proposed method.

Download Full Size | PDF

In addition, when the input-view number is increased, the accuracy of foreground can be improved, but occluded parts of background are also expanded. An example of Middlebury views (Drumsticks) is shown in Fig. 3, where obtained foreground and background planes under different input views are presented. We can see that the foreground result of 5-input views is more accurate, but the missing regions of background under 5-input views are much larger than 3-input views and 2-input views. After softmax normalization, focused regions would be identified with a peak weight value along the depth dimension, but the occluded backgrounds turn to be distributed without a concentrated peak, which causes ghost artifacts and inconsistent quality when novel views are predicted. Therefore, multiple networks are applied with diverse input numbers to estimate the scene distribution for an accurate foreground and a complete background.

 figure: Fig. 3.

Fig. 3. Foreground and background results of all-input-view network under different input-view number. When the input number is increased, the accuracy of foreground can be improved, but the occluded parts of background are also expanded.

Download Full Size | PDF

2.2 Overall of the proposed algorithm

The overall architecture of the proposed algorithm is given as following in Fig. 4. All posed views are combinatorially taken and input into diverse types of networks to obtain respective color and selection towers. For each posed view, related color towers and selection towers are fused together. Towers of different viewpoint are blended as a complete scene color tower and a selection tower. The blended scene towers are used to synthesize high quality novel views by soft-projection method with photometric consistency and correct occlusions. A denoising network is used to improve the quality of final synthesized views. As a result, a sequence of dense virtual viewpoints can be generated with consistent high-quality.

 figure: Fig. 4.

Fig. 4. The overall architecture of the proposed algorithm. (a) Posed views are combinatorially input into unsupervised CNNs to predict respective towers, and towers of the same viewpoint are fused together. (b) All posed-view towers are blended as a scene color tower and a scene selection tower. (c) Blended scene towers are soft-projected to synthesize virtual views with correct occlusions. (d) A denoising network is used to improve the image quality of final synthetic views.

Download Full Size | PDF

2.3 Multiple unsupervised networks

Since a single unsupervised network cannot handle with the occluded regions on the background, multiple unsupervised networks are adopted in this paper. As shown in Fig. 5, the structures of these networks are similar with our previous work [16,17], but the input-view numbers are variant. All posed views are combinatorially taken and input into the corresponding network. Image features are concatenated and homography warped to the target view at different depth planes. 2D color network is applied to predict a color map. And 2D+3D selection network is applied to predict a selection map. They can be stacked as a color tower and a selection tower. Note that, if there are 5 input views, $5 - 1 = 4$ network types would be adopted. These networks have the same structure except their inputs, which are 5 views, 4 views, 3 views and 2 views, respectively. Considering the combination of residual inputs, there would be $C_4^4 + C_4^3 + C_4^2 + C_4^1 = 1 + 4 + 6 + 4 = 15$ color towers and $15$ selection towers for each posed view.

 figure: Fig. 5.

Fig. 5. The structure of unsupervised CNN for view synthesis. Posed views are combinatorially input into the network. At different depth plane, image features are homography warped to the target view. A 2D color network is applied to predict the color distribution. And a 2D+3D selection network is applied to predict the selection map.

Download Full Size | PDF

2.4 Related towers fusion

For each posed view, related towers generated by corresponding networks are necessary to be fused as a single color tower and a selection tower. Along the depth dimension, the probability distribution of selection tower is always concentrated to one peak, such as foreground regions. But for the false background prediction caused by foreground occlusion, the probability distribution would be scattered or flattened. For this reason, related towers of the same view are fused to restore the false parts. For example, there are 5 input views and 15 related selection maps. Probability errors appear on background maps, as shown in Fig. 6. After fusion, we can see that such false regions of background are mostly filled and revised. In practice, instead of fusing related color towers, we directly multiply the original color image by the fused selection map, and take the outcome as the fused color map, to reduce pixel errors brought from CNN predicting.

 figure: Fig. 6.

Fig. 6. Fusion of multiple selection maps. After fusion, such probability errors of background are mostly filled and revised. And the fused color map is obtained by multiplying the original color image and the fused selection map.

Download Full Size | PDF

For each depth plane, the fused selection result $s_t^p$ is computed by the following equation,

$$s_t^p = \frac{{\sum\limits_{n = 2}^N {\sum\limits_{c = 1}^{n,Comb} {s_{n,\,t \in c}^p \times \alpha _{n,\,t \in c}^p \times {w_n}} } }}{{\sum\limits_{n = 2}^N {\sum\limits_{c = 1}^{n,Comb} {\alpha _{n,\,t \in c}^p \times {w_n}} } }},$$
where $p$ is the depth plane index, t is the posed view index. N is the number of all posed views. n is the current input-view number. $Comb$ is the amount of input-view combinations under current input-view number n, and c is the current combination including index t. $s_{n,t \in c}^p$ is the selection map of the t selection tower at plane p. And $\alpha _{n,t \in c}^p$ is the weight of different combination views, which is inversed to the square of the distance L between cameras, $\alpha _{n,t \in c}^p = s_{n,t \in c}^p/{L^2}$. ${w_n}$ is the weight of the network type n.

2.5 Scene tower blending

The fused towers of all posed views are back-projected into the global coordinate space, which can be blended as two complete scene towers, a scene color tower and a scene selection tower, indicating color and probability distributions of 3D scene along the depth. The example of scene blending result is shown in Fig. 7.

 figure: Fig. 7.

Fig. 7. The blended scene maps. After blending, complete scene color distributions and selection distributions can be obtained

Download Full Size | PDF

For each plane, the scene selection tower is blended as the following expression,

$${s^p} = \frac{{\sum\limits_{t = 1}^N {s_t^p \times s_t^p} }}{{\sum\limits_{t = 1}^N {s_t^p} }}, $$
where $p$ is the depth plane index, ${s^p}$ is the blended scene selection plane, $s_t^p$ is the selection plane of each fused selection tower. The scene color tower is blended as the following expression,
$${c^p} = \frac{{\sum\limits_{t = 1}^N {c_t^p \times s_t^p} }}{{\sum\limits_{t = 1}^N {s_t^p} }}, $$
where $p$ is the plane index, ${c^p}$ is the blended scene color plane, $c_t^p$ is the color plane of each fused color tower, and $s_t^p$ is the selection map of each fused selection tower.

2.6 Soft-projection of tower

Since CNN can hardly deal with the projection from 3D objects to 2D image, layered images are always conducted by non-learning methods to synthesize novel views. Soft3D was proposed to synthesize views by soft 3D reconstruction [18], but photo-consistency was not taken into consideration. In our algorithm, soft-projection of tower is proposed to synthesize arbitrary viewpoints with blended scene towers.

In soft-projection, respective planes of the scene color tower and the selection tower are projected to the target position by homography warping. A virtual view is obtained by the weighted summation of these projected plane, as the following expression,

$${I_x} = \frac{{\sum\nolimits_p {Prj _x^p({c^p} \times {s^p} \times msk_x^p)} }}{{\sum\nolimits_p {Prj _x^p({s^p} \times msk_x^p)} }},\quad ms{k_x}^p = oc{c_x}^p \times co{n_{x,N}}^p, $$
where ${I_x}$ is the color image of the virtual view, $\textrm{Prj}_x^p$ is the homography warping operation of each plane to the virtual position, and $msk_x^p$ is the threshold mask of each plane. $occ_x^p$ is the foreground occlusion mask of view ${I_x}$, and $con_{x,N}^p$ is the threshold map of photometric consistency between ${I_x}$ and all posed views. Also, the depth image ${D_x}$ can be computed as the following expression,
$${D_x} = \frac{{\sum\nolimits_p {Prj _x^p(p \times {s^p} \times msk_x^p)} }}{{\sum\nolimits_p {Prj _x^p({s^p} \times msk_x^p)} }} \times \Delta z + {z_0}, $$
where $\Delta z$ is the depth space between two planes, and ${z_0}$ is the initial depth of the first plane.

The foreground occlusion mask $occ_x^p$ is used to block out background planes. For the blended scene selection tower, there could be more than one peak of the probability distribution along the depth. In the projection operation, the front peak should occlude the back scene, especially the back peaks. Due to the concentration of probability distribution, it is not difficult to differentiate between them by the accumulated selection value. A threshold $Th{d_{occ}}$ is used to see whether there is a front scene occluding the current plane, where $occ_x^p$ is $1$ if $1\sum {s^p} \le Th{d_{occ}}$, and is $0$ if $\sum {s^p} > Th{d_{occ}}$. The difference between results with and without occlusion is shown in Fig. 8. It can be seen that, there are obvious ghost artifacts on the result without occlusion, and the other one with occlusion is correct.

 figure: Fig. 8.

Fig. 8. (a) Two blended scene planes. (b) The selection distribution at the same position. (c) Synthesized results with and without occlusion. There are obvious ghost artifacts on the result without occlusion, and the other one is correct.

Download Full Size | PDF

The photometric consistency map $con_{x,N}^p$ is used to maintain the pixel consistency between scene planes and all posed views in weighted summation. In order to avoid pixel mistakes removed, the posed view occlusion mask $occ_t^p$ of each posed view is also considered. Each color plane ${c_p}$ and selection map ${s_p}$ are projected to the posed view ${I_t}$ to calculate the absolute pixel error $err_t^p$ with the front occlusion mask $occ_t^p$, as the following expression,

$$err_t^p = occ_t^p \times {\left|\left|{\frac{{({I_t} - Prj_t^p({c^p})) \times Prj_t^p({s^p})}}{{Prj_t^p({s^p})}}} \right|\right|_1}, $$
where t is the posed view index, and $\textrm{Prj}_t^p$ is the projection operation of each plane to the target posed position. A threshold $Th{d_{con}}$ is used to see whether color pixels are consistent with view ${I_t}$, where $con_t^p$ is 1 if $err_t^p \le Th{d_{con}}$, and is 0 if $err_t^p > Th{d_{con}}$. The overall consistency map $con_{x,N}^p$ is computed by projecting each posed view consistency map $con_t^p$ back to the virtual view ${I_x}$ and removing all outlier parts, as the following expressions,
$$\;con_{x,N}^p = \coprod\limits_{t = 1}^N {Prj _x^p(con_t^p)}, $$
For example, as shown in Fig. 9, there are different results on blended color planes. When the photometric consistency map $con_{x,N}^p$ and the occlusion mask $occ_x^p$ are not applied in soft-projection, false pixels would be added together on the synthesized result causing ghost artifacts, as shown in Fig. 9 (a). When there are no occlusion masks $occ_t^p$ of posed views in consistency computing, the background planes would be mistakenly removed, as shown in Fig. 9 (b). When the consistency map $con_{x,N}^p$ and occlusion masks $occ_x^p$ and are all considered, the synthesized result would be correct, except that there are some small black holes on image object edges.

 figure: Fig. 9.

Fig. 9. Different blended color results under different conditions. (a) No photometric consistency maps and the occlusion masks are not applied. (b) Photometric consistency maps are applied without occlusion masks. (c) Both photometric consistency maps and the occlusion masks are applied.

Download Full Size | PDF

2.7 Black hole filling

Small black holes are always inevitable on the color result of the view generated by the above proposed non-learning soft-projection method. There are many hole filling approaches based on traditional methods and deep learning methods, mainly used for view synthesis such as depth image based rendering (DIBR) [19], and for image inpainting [20]. In consideration of the quality continuity between different viewpoints, there are two steps in the proposed filling method. A traditional filling method is used to complete images first, and a denoising network is used to refine the final color image.

In the traditional method, rather than directly filling the color image, we complete the depth image and pick color pixels from scene color tower at relative depth planes. Black holes are identified by checking the summation of all projected selection maps, where a threshold $Th{d_{hole}}$ is applied to see whether the summing value is high enough. As shown in Fig. 10, these black holes can be gradually filled by expanding surrounding areas from outside to inside until each hole is completed. However, this step would cause depth errors on object edges, because both foreground area and background area are expanded at the same time whereas black holes are always occluded scenes at backgrounds. For this reason, expansions are needed to be continually executed for the same times to replace the foreground depth with the background. After the depth image is filled, new pixels can be picked from the scene color tower and the color image is updated.

 figure: Fig. 10.

Fig. 10. Steps of black hole filling. Black holes are identified by applying a threshold on the summation of projected selection maps. They are filled by expanding surrounding areas several times until each hole is completed. Moreover, expansions are continually executed for the same times to replace the foreground depth with the background. After the depth image is filled, new pixels would be picked from the color tower and update the color image.

Download Full Size | PDF

At last, a denoising network based on GAN structure is used to refine the color result, as shown in Fig. 11. Similar architecture works very well in filling missing parts [21]. This network is unsupervised trained by reconstructing posed views with a pixel-wise loss as well as an adversarial loss. The final synthesized view can be predicted by this network in a high quality. In order to prevent over fitting of recovering input views, and keep quality continuity between different views, images are trained in cropped sizes.

 figure: Fig. 11.

Fig. 11. The denoising network. GAN structure is used to refine the color result. It is trained by reconstructing posed views. And the final virtual view can be predicted in a high quality.

Download Full Size | PDF

3. Implementation and analysis

3.1 Method implementation

There are three image arrays used in the proposed algorithm, including multiple views of Middlebury datasets (Drumsticks), multiple views of virtual scene (Island) rendered by ourselves on 3Ds Max, and multiple views of real scene datasets (University) [22]. Structure from motion (SFM) is used to calibrate camera positions. An array of 5 image views is used to synthesize virtual views. More than 4000 image arrays with 960×540 image resolution are collected, and patches of 96×96 are randomly cropped for training. 64 depth planes are used for constructing each tower. The network structure is proposed in our previous work [16,17]. Due to 5 input views, 4 networks are designed with inputs of 5 views, 4 views, 3 views and 2 views. And 15 selection towers are predicted and fused together for each view. These networks are programmed with Tensorflow and operated on two NVIDIA Quadro P6000 GPUs. In the training part, each iteration with 2 batches costs about 3.7s. In practice, the proposed algorithm is very hard to run very fast, but it can bring a very good result, about 3mins for saving view tower planes, about 4mins for fusion and blending, 30s for soft-projection and hole filling on each novel view, and 100 ms for image denoising of each view.

The proposed method is suitable for many kinds of 3D light field displays [2326]. In our experiment, an innovative 3D light-field display is used, which is equipped with a lenticular-lens array, a 27 inches LCD panel with 4 K resolution, and a holographic functional screen (HFS), as shown in Fig. 12. The lenticular-lens array is designed with 0.7 mm lens pitch, 9.46° slanted angle, and 0.606 mm focal length. The 3D light-field display is able to provide 60 views in 60° view angles with correct occlusion and smooth motion parallax. Virtual dense views are generated to overcome the problem of data collection and displayed on the 3D device. These views can provide a relative good displaying quality.

Peak Signal-Noise Ratio (PSNR) and Structural Similarity index (SSIM) are calculated to estimate synthesized results. In general, when SSIM is higher than 0.95 or PSNR is higher than 30 dB, the image quality is satisfied. When SSIM is lower than 0.9 or PSNR is lower than 20 dB, the image quality can not be accepted.

3.2 Analysis of thresholds

In the synthesis method, there are three main thresholds, including the consistency threshold $Th{d_{con}}$, the occlusion threshold $Th{d_{occ}}$, and the black hole threshold $Th{d_{hole}}$. Simulations of various threshold values and various disparities are testified. Several 5-view image arrays of virtual scene with 960×540 resolution are utilized for synthesis analyzing, and horizontal disparity ranges are changing from 68 pixels to 200 pixels (7%, 12%, 15%, 18%, 21% image width). Virtual views are synthesized without denoised under different $Th{d_{occ}}$, $Th{d_{occ}}$, $Th{d_{hole}}$. PSNR is used to estimate their quality. As shown in Fig. 13, averages of all views under various disparity ranges are calculated and plotted, and their average lines are also considered. The optimum range of the average line is marked where results are very close to the maximum, about within 0.1 dB. We can see that, although the quality declines with the increase of disparities, each threshold has similar convex value distribution under various disparity ranges, which is increased at first and then decreased. For the consistency threshold $Th{d_{con}}$, the optimum range is around [0.30, 0.40]. The distribution reason is that a higher threshold $Th{d_{con}}$ causes ghost artifacts, and a lower one creates large black holes that are not easy to be filled. For occlusion threshold $Th{d_{occ}}$, the optimum range is around [0.5, 0.85]. The reason is that the selection tower of blended scene is concentrated to only a few peaks along the depth, of which occlusions are large enough to be identified. For the hole threshold $Th{d_{hole}}$, the optimum range of is around [0.05, 0.35]. The reason is that there are only a few pixels with low probabilities, and a higher $Th{d_{hole}}$ would remove more pixels and create large black holes. In the following experiment, $Th{d_{con}}$ is set 0.35, $Th{d_{occ}}$ is 0.67, and $Th{d_{hole}}$ is 0.20.

 figure: Fig. 12.

Fig. 12. The 3D light field display used in our experiments, which supports 60 views in 60° view angles.

Download Full Size | PDF

 figure: Fig. 13.

Fig. 13. Simulations of various threshold values and under various disparity ranges are testified, and optimum ranges can be determined. The result under disparity range of 7% image width is plotted in black. 12% is in orange. 15% is in blue. 18% is in cyan. 21% is in magenta. The average of all line is plotted in red dash.

Download Full Size | PDF

In the fusion of related view towers, diverse networks have respective weights ${w_n}$. When 5 posed views are input, there are 4 weights $[{{w_2},\;{w_3},\;{w_4},\;{w_5}} ]$ corresponding to CNNs with 2-input views, 3-input views, 4-input views, and 5-input views. Simulations of various weights are testified, and 75 combinations are given in the following Fig. 14, and sorted by the average PSNR of all synthesized views. From Fig. 14, we can see that, the result of $[{0.5,\;0.5,\;2,\;1} ]$ is better than others, which means that the weight ${w_4}$ has a higher effect on the synthesis view than ${w_5}$, and ${w_2}$ and ${w_3}$ have similar weaker effects. The reason is that the 4-input-view network can accurately detect the scene distribution without affected by occlusions. In addition, along the all view average line, the middle-view average line always drops down when ${w_2}$ or ${w_3}$ is set at $0$, which means PSNR of side views rise up. This is due to the fact that the quality of two-side views synthesized by the 2-view and 3-view networks is not so good as that of the middle ones. Thus, ${w_2}$ and ${w_3}$ would bring about a non-uniform effects to different virtual views.

 figure: Fig. 14.

Fig. 14. Simulations of 75 combinations of fusion weights are testified. Results are sorted by the average PSNR of all synthesized views.

Download Full Size | PDF

When dealt with various disparity ranges of input posed views, the performance of our proposed algorithm has an acceptable result compared with other methods, as shown in Fig. 15. The unsupervised 5-view CNN (red lines) is the network with 5 input views. The unsupervised MPI (blue lines) is the MPI network trained with 5 posed views. Our method is divided into two items, with and without denoising network (green lines and black lines), respectively. From the figure, it can be seen that, the quality of synthesized images declines with the increase of disparity ranges. Results of our methods are better than that of the other two methods, and the denoising network has brought about a vast improvement in the final view’s quality, more than 3 dB. The quality of posed views generated by our method without denoising is similar to that of MPI, but the quality between posed views is much more consistent. Under the disparity range of 144 pixels (15% image width), synthesized views are better than 30 dB, and the denoised views are around 33 dB. And under 200 pixels (21% image width), synthesized views are about 28 dB, and the denoised views are still above 30 dB. These mean our algorithm are very effective for view synthesis under various disparities.

 figure: Fig. 15.

Fig. 15. The comparison of method under various disparity ranges. The red line is the unsupervised 5-view CNN that is the network with 5 input views. The blue line is the unsupervised MPI which is the MPI network trained with 5 posed views. The black line is the proposed method without denoised. The greed line is the proposed method with denoised.

Download Full Size | PDF

4. Experiments

4.1 View synthesis

View synthesis comparison experiments are carried out on three scenes to validate the proposed method, including the Middlebury views (Drumsticks), the virtual scene views (Island), and the real scene views (University), as shown in Fig. 16. The horizontal disparity ranges are 15%, 15%, 12% image width, respectively. Learning and non-learning algorithms are considered in experiments, such as the unsupervised MPI, the unsupervised 5-view CNN, Soft 3D, our methods with and without denoising CNN. Since there is not open source code of Soft 3D currently, this algorithm is implemented from the paper description, and the initial depth map of each view is computed by the fused selection tower for fair comparison.

 figure: Fig. 16.

Fig. 16. Three different scenes in the comparison experiment. (a) is Drumsticks belonging to the Middlebury dataset. (b) is Island belonging to the virtual render scene. And (c) is University belonging to the real scene dataset.

Download Full Size | PDF

For the Drumsticks scene, both posed views and virtual views are generated, and image details are presented in red and blue rectangles, as shown in Fig. 17. We can see that our methods have superior performances. When reconstructing posed views, the soft-projection result achieves PSNR 30 dB and SSIM 0.93. After denoised by CNN, PSNR is about 33 dB and SSIM is 0.95, with less noises and smooth object edges. Results of unsupervised MPI and 5-view CNN are more blurred, because background estimations are disturbed by the foreground, which are about 28 dB and 27 dB. The result of Soft 3D is not good as well, because of losing photo-consistency. Comparison results of virtual views are similar to that of posed views. Our methods show better outcomes with clear object details. However, the quality of other methods decreases a lot with ambiguous details, because all-input-view networks have limited abilities to predict occluded scenes of virtual views. In addition, we note that though the denoising CNN is very useful to improve image quality, it is hardly able to reconstruct details lost in the previous soft-projection and filling parts, such as the occluded white words.

 figure: Fig. 17.

Fig. 17. Results for Drumsticks scene. Our methods have superior performances compared with other method.

Download Full Size | PDF

For the Island and University scenes, virtual views are generated, as shown in Fig. 18. Image details are shown in rectangles below. It can be seen that our methods outperform other methods. In Fig. 18 (a) the Island scene, our soft-projection method can reconstruct scene with only a few errors, of which PSNR is above 30 dB and SSIM is about 0.94. Our denoising method refines most of them, such as completing the line aside of the statue and refining the background sky, and improves PSNR to 32 dB and SSIM to 0.95. But for other method results, there are obvious blurred details and ghost artifacts. In Fig. 18 (b) the University scene, since there are pose estimation errors between different views in SFM, the result quality is relatively lower than that of virtual scenes. Even so, we can see that our methods still have acceptable outputs, where words on the red wall are recovered in complete appearance and details around the tree are clear. But there are some obvious problems in other method results, for example, parts of predicted details are missing in unsupervised 5-view CNN and Soft 3D, and details of unsupervised MPI are completely blurred. The performance of our method is outstanding mainly because it is able to accurately estimate the overall scene distribution along depth by towers fusion and blending, and effectively synthesize virtual views by soft-projection. The final denoising process is also very useful in image quality improvement.

 figure: Fig. 18.

Fig. 18. Results for Island and University scenes. Our methods have superior performances compared with other method in different scenes.

Download Full Size | PDF

4.2 Dense view synthesis

Dense virtual views can be synthesized by the proposed method. 60 virtual views are generated with 5 posed views, parallel and horizontally locating from the leftmost posed camera to the rightmost one. Related EPIs (Epipolar Plane Image) are computed, as shown in Fig. 19. Dense-view sequences of different scenes are presented on our 3D light-field displaying device, as shown in Fig. 20. From the performance of the display, we can see that these views are synthesized in a very good quality with smooth motion parallax and correct occlusion. These experiments validate the effectiveness of our proposed method for 3D light-field display.

 figure: Fig. 19.

Fig. 19. EPIs of different scenes.

Download Full Size | PDF

 figure: Fig. 20.

Fig. 20. Results of 3D light-field display. Image view sequences are generated after denoised. (see Visualization 1)

Download Full Size | PDF

5. Conclusion

In summary, a method is proposed to synthesize virtual view for 3D light-field display based on scene tower blending. Posed views are combinatorially input into diverse types of networks to predict respective view towers, and towers of the same viewpoint are fused together. All posed-view towers are blended as scene towers. Blended scene towers are soft-projected to synthesize virtual views. A denoising network is used to improve the image quality of final synthetic views. Experimental results validate performances of the proposed method. PSNR of the virtual views is about 30 dB and SSIM is above 0.91. Since complete 3D scene distributions at different depth planes can be accurately estimated, our method can generate high-quality dense views under various disparities. We think our method can also be potentially applied in other applications. For example, it can be used in the collection of dense views by saving labors and data memories. And it can be used in multi-view encoding for the remote 3D display by saving transmitting bandwidth.

Funding

National Natural Science Foundation of China (61905017, 61905020, 62075016).

Disclosures

The authors declare no conflicts of interest. This work is original and has not been published elsewhere.

References

1. X. Sang, X. Gao, and X. Yu, “Interactive floating full-parallax digital three-dimensional light-field display based on wavefront recomposing,” Opt. Express 26(7), 8883–8889 (2018). [CrossRef]  

2. X. Yu, X. Sang, and S. Xing, “Natural three-dimensional display with smooth motion parallax using active partially pixelated masks,” Opt. Commun. 313, 146–151 (2014). [CrossRef]  

3. R. Ng, M. Levoy, and M. Brédif, “Light field photography with a hand-held plenoptic camera,” Stanford Tech. Report 2(11), 1–11 (2005).

4. B. Wilburn, N. Joshi, and V. Vaish, “High performance imaging using large camera arrays,” ACM Trans. Graph. 24(3), 765–776 (2005). [CrossRef]  

5. S. Chan, H. Shum, and K. Ng, “Image-Based Rendering and Synthesis,” IEEE Signal Process. Mag. 24(6), 22–33 (2007). [CrossRef]  

6. G. Wu, M. Zhao, and L. Wang, “Light Field Reconstruction Using Deep Convolutional Network on EPI,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2017), pp. 6319–6327.

7. D. Ji, J. Kwon, and M. Mcfarland, “Deep View Morphing,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2017), pp. 7092–7100.

8. T. Zhou, S. Tulsiani, and W. Sun, “View Synthesis by Appearance Flow,” in Proceedings of European Conference on Computer Vision, (Springer, 2016), pp. 286–301.

9. B. Mildenhall, P. Srinivasan, and M. Tancik, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” in Proceedings of European Conference on Computer Vision, (Springer, 2020).

10. J. Flynn, I. Neulander, and J. Philbin, “Deep Stereo: Learning to Predict New Views from the World's Imagery,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2016), pp. 5515–5524.

11. Z. Xu, S. Bi, and K. Sunkavalli, “Deep view synthesis from sparse photometric images,” ACM Trans. Graph. 38(4), 1–13 (2019). [CrossRef]  

12. T. Zhou, R. Tucker, and J. Flynn, “Stereo Magnification: Learning View Synthesis using Multiplane Images,” ACM Trans. Graph. 37(4), 1–12 (2018). [CrossRef]  

13. P. Srinivasan, R. Tucker, and J. Barron, “Pushing the Boundaries of View Extrapolation With Multiplane Images,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 175–184.

14. J. Flynn, M. Broxton, and P. Debevec, “DeepView: View Synthesis With Learned Gradient Descent,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 2362–2371.

15. B. Mildenhall, P. Srinivasan, and R. Ortiz-Cayon, “Local light field fusion: Local light field fusion: practical view synthesis with prescriptive sampling guidelines,” ACM Trans. Graph. 38(4), 1–14 (2019). [CrossRef]  

16. D. Chen, X. Sang, and P. Wang, “Multi-parallax views synthesis for three-dimensional light-field display using unsupervised CNN,” Opt. Express 26(21), 27585–27598 (2018). [CrossRef]  

17. D. Chen, X. Sang, and P. Wang, “Dense-view synthesis for three-dimensional light-field display based on unsupervised learning,” Opt. Express 27(17), 24624–24641 (2019). [CrossRef]  

18. E. Penner and L. Zhang, “Soft 3D reconstruction for view synthesis,” ACM Trans. Graph. 36(6), 1–11 (2017). [CrossRef]  

19. N. Guo, X. Sang, and S. Xie, “Efficient Image Warping in Parallel for Multiview Three-Dimensional Displays,” J. Disp. Technol. 12(11), 1335–1343 (2016). [CrossRef]  

20. D. Kim, S. Woo, and J. Lee, “Deep Video Inpainting,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 5785–5794.

21. D. Pathak, P. Krähenbühl, and J. Donahue, “Context Encoders: Feature Learning by Inpainting,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2016), pp. 2536–2544.

22. G. Chaurasia, S. Duchene, and O. Sorkine-Hornung, “Depth Synthesis and Local Warps for Plausible Image-based Navigation,” ACM Trans. Graph. 32(3), 1–12 (2013). [CrossRef]  

23. X. Yu, X. Sang, X. Gao, D. Chen, B. Liu, L. Liu, C. Gao, and P. Wang, “Dynamic three-dimensional light-field display with large viewing angle based on compound lenticular lens array and multi-projectors,” Opt. Express 27(11), 16024–16031 (2019). [CrossRef]  

24. X. Yu, X. Sang, X. Gao, B. Yan, D. Chen, B. Liu, L. Liu, C. Gao, and P. Wang, “360-degree tabletop 3D light-field display with ring-shaped viewing range based on aspheric conical lens array,” Opt. Express 27(19), 26738–26748 (2019). [CrossRef]  

25. X. Li, M. Zhao, Y. Xing, H. Zhang, L. Li, S. Kim, X. Zhou, and Q. Wang, “Designing optical 3D images encryption and reconstruction using monospectral synthetic aperture integral imaging,” Opt. Express 26(9), 11084–11099 (2018). [CrossRef]  

26. Y. Xing, Y. Xia, S. Li, H. Ren, and Q. Wang, “Annular sector elemental image array generation method for tabletop integral imaging 3D display with smooth motion parallax,” Opt. Express 28(23), 34706–34716 (2020). [CrossRef]  

Supplementary Material (1)

NameDescription
Visualization 1       Virtual view results and experiments of 3D light-field display

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (20)

Fig. 1.
Fig. 1. The schematic comparison between different methods. For example, there are 3 posed views used for view synthesis. (a) is the previous CNN method. (b) is the LLFF method. (c) is the proposed fusion and blending method.
Fig. 2.
Fig. 2. The schematic comparison between previous methods and the proposed method. (a) There are occlusion errors in background estimation for all-input-view CNN methods. (b) The blended scene distribution is complete and correct for the proposed method.
Fig. 3.
Fig. 3. Foreground and background results of all-input-view network under different input-view number. When the input number is increased, the accuracy of foreground can be improved, but the occluded parts of background are also expanded.
Fig. 4.
Fig. 4. The overall architecture of the proposed algorithm. (a) Posed views are combinatorially input into unsupervised CNNs to predict respective towers, and towers of the same viewpoint are fused together. (b) All posed-view towers are blended as a scene color tower and a scene selection tower. (c) Blended scene towers are soft-projected to synthesize virtual views with correct occlusions. (d) A denoising network is used to improve the image quality of final synthetic views.
Fig. 5.
Fig. 5. The structure of unsupervised CNN for view synthesis. Posed views are combinatorially input into the network. At different depth plane, image features are homography warped to the target view. A 2D color network is applied to predict the color distribution. And a 2D+3D selection network is applied to predict the selection map.
Fig. 6.
Fig. 6. Fusion of multiple selection maps. After fusion, such probability errors of background are mostly filled and revised. And the fused color map is obtained by multiplying the original color image and the fused selection map.
Fig. 7.
Fig. 7. The blended scene maps. After blending, complete scene color distributions and selection distributions can be obtained
Fig. 8.
Fig. 8. (a) Two blended scene planes. (b) The selection distribution at the same position. (c) Synthesized results with and without occlusion. There are obvious ghost artifacts on the result without occlusion, and the other one is correct.
Fig. 9.
Fig. 9. Different blended color results under different conditions. (a) No photometric consistency maps and the occlusion masks are not applied. (b) Photometric consistency maps are applied without occlusion masks. (c) Both photometric consistency maps and the occlusion masks are applied.
Fig. 10.
Fig. 10. Steps of black hole filling. Black holes are identified by applying a threshold on the summation of projected selection maps. They are filled by expanding surrounding areas several times until each hole is completed. Moreover, expansions are continually executed for the same times to replace the foreground depth with the background. After the depth image is filled, new pixels would be picked from the color tower and update the color image.
Fig. 11.
Fig. 11. The denoising network. GAN structure is used to refine the color result. It is trained by reconstructing posed views. And the final virtual view can be predicted in a high quality.
Fig. 12.
Fig. 12. The 3D light field display used in our experiments, which supports 60 views in 60° view angles.
Fig. 13.
Fig. 13. Simulations of various threshold values and under various disparity ranges are testified, and optimum ranges can be determined. The result under disparity range of 7% image width is plotted in black. 12% is in orange. 15% is in blue. 18% is in cyan. 21% is in magenta. The average of all line is plotted in red dash.
Fig. 14.
Fig. 14. Simulations of 75 combinations of fusion weights are testified. Results are sorted by the average PSNR of all synthesized views.
Fig. 15.
Fig. 15. The comparison of method under various disparity ranges. The red line is the unsupervised 5-view CNN that is the network with 5 input views. The blue line is the unsupervised MPI which is the MPI network trained with 5 posed views. The black line is the proposed method without denoised. The greed line is the proposed method with denoised.
Fig. 16.
Fig. 16. Three different scenes in the comparison experiment. (a) is Drumsticks belonging to the Middlebury dataset. (b) is Island belonging to the virtual render scene. And (c) is University belonging to the real scene dataset.
Fig. 17.
Fig. 17. Results for Drumsticks scene. Our methods have superior performances compared with other method.
Fig. 18.
Fig. 18. Results for Island and University scenes. Our methods have superior performances compared with other method in different scenes.
Fig. 19.
Fig. 19. EPIs of different scenes.
Fig. 20.
Fig. 20. Results of 3D light-field display. Image view sequences are generated after denoised. (see Visualization 1)

Equations (7)

Equations on this page are rendered with MathJax. Learn more.

s t p = n = 2 N c = 1 n , C o m b s n , t c p × α n , t c p × w n n = 2 N c = 1 n , C o m b α n , t c p × w n ,
s p = t = 1 N s t p × s t p t = 1 N s t p ,
c p = t = 1 N c t p × s t p t = 1 N s t p ,
I x = p P r j x p ( c p × s p × m s k x p ) p P r j x p ( s p × m s k x p ) , m s k x p = o c c x p × c o n x , N p ,
D x = p P r j x p ( p × s p × m s k x p ) p P r j x p ( s p × m s k x p ) × Δ z + z 0 ,
e r r t p = o c c t p × | | ( I t P r j t p ( c p ) ) × P r j t p ( s p ) P r j t p ( s p ) | | 1 ,
c o n x , N p = t = 1 N P r j x p ( c o n t p ) ,
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.