Abstract
Three-dimensional (3D) light-field display has achieved a great improvement. However, the collection of dense viewpoints in the real 3D scene is still a bottleneck. Virtual views can be generated by unsupervised networks, but the quality of different views is inconsistent because networks are separately trained on each posed view. Here, a virtual view synthesis method for the 3D light-field display based on scene tower blending is presented, which can synthesize high quality virtual views with correct occlusions by blending all tower results, and dense viewpoints on 3D light-field display can be provided with smooth motion parallax. Posed views are combinatorially input into diverse unsupervised CNNs to predict respective input-view towers, and towers of the same viewpoint are fused together. All posed-view towers are blended as a scene color tower and a scene selection tower, so that 3D scene distributions at different depth planes can be accurately estimated. Blended scene towers are soft-projected to synthesize virtual views with correct occlusions. A denoising network is used to improve the image quality of final synthetic views. Experimental results demonstrate the validity of the proposed method, which shows outstanding performances under various disparities. PSNR of the virtual views are about 30 dB and SSIM is above 0.91. We believe that our view synthesis method will be helpful for future applications of the 3D light-field display.
© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement
1. Introduction
Recently, the three-dimensional (3D) light field display has achieved a great improvement, which is considered as one of the most potential 3D display methods. By effectively modulating light directions, the 3D light field display is able to present 3D scenes with large viewing angles and dense viewpoints, and more 3D information can be perceived [1,2]. However, the collection of dense views in the real 3D scene is still a bottleneck for wide applications of the 3D light field display.
Many approaches are proposed to obtain multiple views. Light field camera and dense camera array can be used to capture multiple views directly, such as Lytro and Stanford large camera arrays [3,4]. However, both of them are not suitable for 3D light field collections, because of physical limitations, such as short baselines between views and complex manufacture procedures. View synthesis algorithms associated with sparse views can be used to generate multiple virtual views. Image Based Rendering (IBR) is proposed to synthesize a novel view by warping nearby existed views without 3D reconstruction [5]. Within a narrow viewing angle, this method can represent 3D scenes with realistic virtual views. However, due to the discontinuity of the scene depth, there are severe distortions and blurs on synthetic images when the disparity becomes large.
Deep learning methods have been used for IBR task, by training a convolutional neutral network (CNN) end-to-end to synthesize novel views. Since CNN is very good at feature extraction and reconstruction, these methods are able to generate synthetic views with high quality. One of them interpolates dense novel views of the light field by angular super-resolution method, such as EPI Reconstruction [6]. However, the maximum disparity is too narrow and the scene geometry is not considered. Another method estimates a correspondence map between posed views and the warped images are blended as a novel view, such as DVM [7] and Appearance Flow [8], but the synthetic position is not flexible. Another method optimizes an underlying continuous volumetric scene function with a sparse set of inputs to synthesize novel views, such as NeRF [9], but the network is required to be retrained for each scene. Other methods compute layered scene distributions along depth by plane sweeping, and project them to synthesize the novel view in various ways. For example, DeepStereo normalizes the scene selection tower by softmax and outputs the result by the layer weighted summation [10,11]. MPI outputs the result by the alpha-transparency map accumulation [12–14]. And LLFF blends multiple MPIs to decrease occlusion errors [15]. However, these methods need plenty of target views in supervision training, which are always difficult to acquire. MPVN [16] and Dense-View Synthesis methods [17], proposed in our previous works, are able to generate multiple views by unsupervised learning, but separate trainings on each posed view result in inconsistent quality across varying viewpoints.
Here, a method of virtual view synthesis for the 3D light-field display based on scene tower blending is proposed. It can synthesize high quality virtual views with correct occlusions by blending 3D view towers, and provide dense viewpoints on the 3D light-field display with smooth motion parallax. The approach consists of four main stages. Unsupervised learning CNNs with combinatorial input views are used to predict respective input-view towers, and towers of the same viewpoint are fused together. All posed-view towers are blended as a scene color tower and a scene selection tower. Blended scene towers are soft-projected to synthesize novel views with correct occlusions. A denoising network is used to improve the image quality of the final synthesized views. Since 3D scene distributions at different depth planes can be accurately estimated, this method can generate high-quality virtual views. Experimental results demonstrate the validity of the proposed method. PSNR of the virtual views is about 30 dB and SSIM is above 0.91 under various disparities.
The schematic comparison is shown in Fig. 1. In previous view synthesis methods, all posed views are reprojected to the virtual view at different depth planes and input into CNN to compute the view tower, and layered distributions of the target view. However, background estimations of the target view are always disturbed by foregrounds from other viewpoints, which causes severe occlusion errors. LLFF solves these problems by blending multiple MPIs, but it requires extra viewpoints in deep learning pipelines. In the proposed method, posed views are combinatorially input into diverse types of networks. All related posed-view towers are fused and blended together as a complete blended scene tower to decrease estimation errors.
2. Proposed method
2.1 Problem of previous CNN methods
For previous all-input-view CNN methods, there are always obvious artifacts on synthesized views, caused by 3D scene occlusion errors [16,17]. In the process, all posed views and their associated features are projected and input into the color network to compute layered distributions at different depth. Since images are convoluted by multiple kernels and summed as new features before ReLU (non-negativity) in CNN, the output color result is always very similar to the refocusing image at that depth, where the refocused regions can be identified by the selection network. As shown in Fig. 2 (a), the previous methods are able to correctly reconstruct the foreground region, but the reconstructed background would be incomplete due to front scene occlusions, which causes the quality degradation of virtual viewpoints. In order to solve this problem, as shown in Fig. 2 (b), posed views are combinatorially input the networks to obtain respective distributions in the proposed method. Related results are fused and blended as a complete and correct scene plane. Thus, the blended scene distributions can be used to synthesize dense views with correct occlusions and consistent quality.
In addition, when the input-view number is increased, the accuracy of foreground can be improved, but occluded parts of background are also expanded. An example of Middlebury views (Drumsticks) is shown in Fig. 3, where obtained foreground and background planes under different input views are presented. We can see that the foreground result of 5-input views is more accurate, but the missing regions of background under 5-input views are much larger than 3-input views and 2-input views. After softmax normalization, focused regions would be identified with a peak weight value along the depth dimension, but the occluded backgrounds turn to be distributed without a concentrated peak, which causes ghost artifacts and inconsistent quality when novel views are predicted. Therefore, multiple networks are applied with diverse input numbers to estimate the scene distribution for an accurate foreground and a complete background.
2.2 Overall of the proposed algorithm
The overall architecture of the proposed algorithm is given as following in Fig. 4. All posed views are combinatorially taken and input into diverse types of networks to obtain respective color and selection towers. For each posed view, related color towers and selection towers are fused together. Towers of different viewpoint are blended as a complete scene color tower and a selection tower. The blended scene towers are used to synthesize high quality novel views by soft-projection method with photometric consistency and correct occlusions. A denoising network is used to improve the quality of final synthesized views. As a result, a sequence of dense virtual viewpoints can be generated with consistent high-quality.
2.3 Multiple unsupervised networks
Since a single unsupervised network cannot handle with the occluded regions on the background, multiple unsupervised networks are adopted in this paper. As shown in Fig. 5, the structures of these networks are similar with our previous work [16,17], but the input-view numbers are variant. All posed views are combinatorially taken and input into the corresponding network. Image features are concatenated and homography warped to the target view at different depth planes. 2D color network is applied to predict a color map. And 2D+3D selection network is applied to predict a selection map. They can be stacked as a color tower and a selection tower. Note that, if there are 5 input views, $5 - 1 = 4$ network types would be adopted. These networks have the same structure except their inputs, which are 5 views, 4 views, 3 views and 2 views, respectively. Considering the combination of residual inputs, there would be $C_4^4 + C_4^3 + C_4^2 + C_4^1 = 1 + 4 + 6 + 4 = 15$ color towers and $15$ selection towers for each posed view.
2.4 Related towers fusion
For each posed view, related towers generated by corresponding networks are necessary to be fused as a single color tower and a selection tower. Along the depth dimension, the probability distribution of selection tower is always concentrated to one peak, such as foreground regions. But for the false background prediction caused by foreground occlusion, the probability distribution would be scattered or flattened. For this reason, related towers of the same view are fused to restore the false parts. For example, there are 5 input views and 15 related selection maps. Probability errors appear on background maps, as shown in Fig. 6. After fusion, we can see that such false regions of background are mostly filled and revised. In practice, instead of fusing related color towers, we directly multiply the original color image by the fused selection map, and take the outcome as the fused color map, to reduce pixel errors brought from CNN predicting.
For each depth plane, the fused selection result $s_t^p$ is computed by the following equation,
2.5 Scene tower blending
The fused towers of all posed views are back-projected into the global coordinate space, which can be blended as two complete scene towers, a scene color tower and a scene selection tower, indicating color and probability distributions of 3D scene along the depth. The example of scene blending result is shown in Fig. 7.
For each plane, the scene selection tower is blended as the following expression,
2.6 Soft-projection of tower
Since CNN can hardly deal with the projection from 3D objects to 2D image, layered images are always conducted by non-learning methods to synthesize novel views. Soft3D was proposed to synthesize views by soft 3D reconstruction [18], but photo-consistency was not taken into consideration. In our algorithm, soft-projection of tower is proposed to synthesize arbitrary viewpoints with blended scene towers.
In soft-projection, respective planes of the scene color tower and the selection tower are projected to the target position by homography warping. A virtual view is obtained by the weighted summation of these projected plane, as the following expression,
The foreground occlusion mask $occ_x^p$ is used to block out background planes. For the blended scene selection tower, there could be more than one peak of the probability distribution along the depth. In the projection operation, the front peak should occlude the back scene, especially the back peaks. Due to the concentration of probability distribution, it is not difficult to differentiate between them by the accumulated selection value. A threshold $Th{d_{occ}}$ is used to see whether there is a front scene occluding the current plane, where $occ_x^p$ is $1$ if $1\sum {s^p} \le Th{d_{occ}}$, and is $0$ if $\sum {s^p} > Th{d_{occ}}$. The difference between results with and without occlusion is shown in Fig. 8. It can be seen that, there are obvious ghost artifacts on the result without occlusion, and the other one with occlusion is correct.
The photometric consistency map $con_{x,N}^p$ is used to maintain the pixel consistency between scene planes and all posed views in weighted summation. In order to avoid pixel mistakes removed, the posed view occlusion mask $occ_t^p$ of each posed view is also considered. Each color plane ${c_p}$ and selection map ${s_p}$ are projected to the posed view ${I_t}$ to calculate the absolute pixel error $err_t^p$ with the front occlusion mask $occ_t^p$, as the following expression,
2.7 Black hole filling
Small black holes are always inevitable on the color result of the view generated by the above proposed non-learning soft-projection method. There are many hole filling approaches based on traditional methods and deep learning methods, mainly used for view synthesis such as depth image based rendering (DIBR) [19], and for image inpainting [20]. In consideration of the quality continuity between different viewpoints, there are two steps in the proposed filling method. A traditional filling method is used to complete images first, and a denoising network is used to refine the final color image.
In the traditional method, rather than directly filling the color image, we complete the depth image and pick color pixels from scene color tower at relative depth planes. Black holes are identified by checking the summation of all projected selection maps, where a threshold $Th{d_{hole}}$ is applied to see whether the summing value is high enough. As shown in Fig. 10, these black holes can be gradually filled by expanding surrounding areas from outside to inside until each hole is completed. However, this step would cause depth errors on object edges, because both foreground area and background area are expanded at the same time whereas black holes are always occluded scenes at backgrounds. For this reason, expansions are needed to be continually executed for the same times to replace the foreground depth with the background. After the depth image is filled, new pixels can be picked from the scene color tower and the color image is updated.
At last, a denoising network based on GAN structure is used to refine the color result, as shown in Fig. 11. Similar architecture works very well in filling missing parts [21]. This network is unsupervised trained by reconstructing posed views with a pixel-wise loss as well as an adversarial loss. The final synthesized view can be predicted by this network in a high quality. In order to prevent over fitting of recovering input views, and keep quality continuity between different views, images are trained in cropped sizes.
3. Implementation and analysis
3.1 Method implementation
There are three image arrays used in the proposed algorithm, including multiple views of Middlebury datasets (Drumsticks), multiple views of virtual scene (Island) rendered by ourselves on 3Ds Max, and multiple views of real scene datasets (University) [22]. Structure from motion (SFM) is used to calibrate camera positions. An array of 5 image views is used to synthesize virtual views. More than 4000 image arrays with 960×540 image resolution are collected, and patches of 96×96 are randomly cropped for training. 64 depth planes are used for constructing each tower. The network structure is proposed in our previous work [16,17]. Due to 5 input views, 4 networks are designed with inputs of 5 views, 4 views, 3 views and 2 views. And 15 selection towers are predicted and fused together for each view. These networks are programmed with Tensorflow and operated on two NVIDIA Quadro P6000 GPUs. In the training part, each iteration with 2 batches costs about 3.7s. In practice, the proposed algorithm is very hard to run very fast, but it can bring a very good result, about 3mins for saving view tower planes, about 4mins for fusion and blending, 30s for soft-projection and hole filling on each novel view, and 100 ms for image denoising of each view.
The proposed method is suitable for many kinds of 3D light field displays [23–26]. In our experiment, an innovative 3D light-field display is used, which is equipped with a lenticular-lens array, a 27 inches LCD panel with 4 K resolution, and a holographic functional screen (HFS), as shown in Fig. 12. The lenticular-lens array is designed with 0.7 mm lens pitch, 9.46° slanted angle, and 0.606 mm focal length. The 3D light-field display is able to provide 60 views in 60° view angles with correct occlusion and smooth motion parallax. Virtual dense views are generated to overcome the problem of data collection and displayed on the 3D device. These views can provide a relative good displaying quality.
Peak Signal-Noise Ratio (PSNR) and Structural Similarity index (SSIM) are calculated to estimate synthesized results. In general, when SSIM is higher than 0.95 or PSNR is higher than 30 dB, the image quality is satisfied. When SSIM is lower than 0.9 or PSNR is lower than 20 dB, the image quality can not be accepted.
3.2 Analysis of thresholds
In the synthesis method, there are three main thresholds, including the consistency threshold $Th{d_{con}}$, the occlusion threshold $Th{d_{occ}}$, and the black hole threshold $Th{d_{hole}}$. Simulations of various threshold values and various disparities are testified. Several 5-view image arrays of virtual scene with 960×540 resolution are utilized for synthesis analyzing, and horizontal disparity ranges are changing from 68 pixels to 200 pixels (7%, 12%, 15%, 18%, 21% image width). Virtual views are synthesized without denoised under different $Th{d_{occ}}$, $Th{d_{occ}}$, $Th{d_{hole}}$. PSNR is used to estimate their quality. As shown in Fig. 13, averages of all views under various disparity ranges are calculated and plotted, and their average lines are also considered. The optimum range of the average line is marked where results are very close to the maximum, about within 0.1 dB. We can see that, although the quality declines with the increase of disparities, each threshold has similar convex value distribution under various disparity ranges, which is increased at first and then decreased. For the consistency threshold $Th{d_{con}}$, the optimum range is around [0.30, 0.40]. The distribution reason is that a higher threshold $Th{d_{con}}$ causes ghost artifacts, and a lower one creates large black holes that are not easy to be filled. For occlusion threshold $Th{d_{occ}}$, the optimum range is around [0.5, 0.85]. The reason is that the selection tower of blended scene is concentrated to only a few peaks along the depth, of which occlusions are large enough to be identified. For the hole threshold $Th{d_{hole}}$, the optimum range of is around [0.05, 0.35]. The reason is that there are only a few pixels with low probabilities, and a higher $Th{d_{hole}}$ would remove more pixels and create large black holes. In the following experiment, $Th{d_{con}}$ is set 0.35, $Th{d_{occ}}$ is 0.67, and $Th{d_{hole}}$ is 0.20.
In the fusion of related view towers, diverse networks have respective weights ${w_n}$. When 5 posed views are input, there are 4 weights $[{{w_2},\;{w_3},\;{w_4},\;{w_5}} ]$ corresponding to CNNs with 2-input views, 3-input views, 4-input views, and 5-input views. Simulations of various weights are testified, and 75 combinations are given in the following Fig. 14, and sorted by the average PSNR of all synthesized views. From Fig. 14, we can see that, the result of $[{0.5,\;0.5,\;2,\;1} ]$ is better than others, which means that the weight ${w_4}$ has a higher effect on the synthesis view than ${w_5}$, and ${w_2}$ and ${w_3}$ have similar weaker effects. The reason is that the 4-input-view network can accurately detect the scene distribution without affected by occlusions. In addition, along the all view average line, the middle-view average line always drops down when ${w_2}$ or ${w_3}$ is set at $0$, which means PSNR of side views rise up. This is due to the fact that the quality of two-side views synthesized by the 2-view and 3-view networks is not so good as that of the middle ones. Thus, ${w_2}$ and ${w_3}$ would bring about a non-uniform effects to different virtual views.
When dealt with various disparity ranges of input posed views, the performance of our proposed algorithm has an acceptable result compared with other methods, as shown in Fig. 15. The unsupervised 5-view CNN (red lines) is the network with 5 input views. The unsupervised MPI (blue lines) is the MPI network trained with 5 posed views. Our method is divided into two items, with and without denoising network (green lines and black lines), respectively. From the figure, it can be seen that, the quality of synthesized images declines with the increase of disparity ranges. Results of our methods are better than that of the other two methods, and the denoising network has brought about a vast improvement in the final view’s quality, more than 3 dB. The quality of posed views generated by our method without denoising is similar to that of MPI, but the quality between posed views is much more consistent. Under the disparity range of 144 pixels (15% image width), synthesized views are better than 30 dB, and the denoised views are around 33 dB. And under 200 pixels (21% image width), synthesized views are about 28 dB, and the denoised views are still above 30 dB. These mean our algorithm are very effective for view synthesis under various disparities.
4. Experiments
4.1 View synthesis
View synthesis comparison experiments are carried out on three scenes to validate the proposed method, including the Middlebury views (Drumsticks), the virtual scene views (Island), and the real scene views (University), as shown in Fig. 16. The horizontal disparity ranges are 15%, 15%, 12% image width, respectively. Learning and non-learning algorithms are considered in experiments, such as the unsupervised MPI, the unsupervised 5-view CNN, Soft 3D, our methods with and without denoising CNN. Since there is not open source code of Soft 3D currently, this algorithm is implemented from the paper description, and the initial depth map of each view is computed by the fused selection tower for fair comparison.
For the Drumsticks scene, both posed views and virtual views are generated, and image details are presented in red and blue rectangles, as shown in Fig. 17. We can see that our methods have superior performances. When reconstructing posed views, the soft-projection result achieves PSNR 30 dB and SSIM 0.93. After denoised by CNN, PSNR is about 33 dB and SSIM is 0.95, with less noises and smooth object edges. Results of unsupervised MPI and 5-view CNN are more blurred, because background estimations are disturbed by the foreground, which are about 28 dB and 27 dB. The result of Soft 3D is not good as well, because of losing photo-consistency. Comparison results of virtual views are similar to that of posed views. Our methods show better outcomes with clear object details. However, the quality of other methods decreases a lot with ambiguous details, because all-input-view networks have limited abilities to predict occluded scenes of virtual views. In addition, we note that though the denoising CNN is very useful to improve image quality, it is hardly able to reconstruct details lost in the previous soft-projection and filling parts, such as the occluded white words.
For the Island and University scenes, virtual views are generated, as shown in Fig. 18. Image details are shown in rectangles below. It can be seen that our methods outperform other methods. In Fig. 18 (a) the Island scene, our soft-projection method can reconstruct scene with only a few errors, of which PSNR is above 30 dB and SSIM is about 0.94. Our denoising method refines most of them, such as completing the line aside of the statue and refining the background sky, and improves PSNR to 32 dB and SSIM to 0.95. But for other method results, there are obvious blurred details and ghost artifacts. In Fig. 18 (b) the University scene, since there are pose estimation errors between different views in SFM, the result quality is relatively lower than that of virtual scenes. Even so, we can see that our methods still have acceptable outputs, where words on the red wall are recovered in complete appearance and details around the tree are clear. But there are some obvious problems in other method results, for example, parts of predicted details are missing in unsupervised 5-view CNN and Soft 3D, and details of unsupervised MPI are completely blurred. The performance of our method is outstanding mainly because it is able to accurately estimate the overall scene distribution along depth by towers fusion and blending, and effectively synthesize virtual views by soft-projection. The final denoising process is also very useful in image quality improvement.
4.2 Dense view synthesis
Dense virtual views can be synthesized by the proposed method. 60 virtual views are generated with 5 posed views, parallel and horizontally locating from the leftmost posed camera to the rightmost one. Related EPIs (Epipolar Plane Image) are computed, as shown in Fig. 19. Dense-view sequences of different scenes are presented on our 3D light-field displaying device, as shown in Fig. 20. From the performance of the display, we can see that these views are synthesized in a very good quality with smooth motion parallax and correct occlusion. These experiments validate the effectiveness of our proposed method for 3D light-field display.
5. Conclusion
In summary, a method is proposed to synthesize virtual view for 3D light-field display based on scene tower blending. Posed views are combinatorially input into diverse types of networks to predict respective view towers, and towers of the same viewpoint are fused together. All posed-view towers are blended as scene towers. Blended scene towers are soft-projected to synthesize virtual views. A denoising network is used to improve the image quality of final synthetic views. Experimental results validate performances of the proposed method. PSNR of the virtual views is about 30 dB and SSIM is above 0.91. Since complete 3D scene distributions at different depth planes can be accurately estimated, our method can generate high-quality dense views under various disparities. We think our method can also be potentially applied in other applications. For example, it can be used in the collection of dense views by saving labors and data memories. And it can be used in multi-view encoding for the remote 3D display by saving transmitting bandwidth.
Funding
National Natural Science Foundation of China (61905017, 61905020, 62075016).
Disclosures
The authors declare no conflicts of interest. This work is original and has not been published elsewhere.
References
1. X. Sang, X. Gao, and X. Yu, “Interactive floating full-parallax digital three-dimensional light-field display based on wavefront recomposing,” Opt. Express 26(7), 8883–8889 (2018). [CrossRef]
2. X. Yu, X. Sang, and S. Xing, “Natural three-dimensional display with smooth motion parallax using active partially pixelated masks,” Opt. Commun. 313, 146–151 (2014). [CrossRef]
3. R. Ng, M. Levoy, and M. Brédif, “Light field photography with a hand-held plenoptic camera,” Stanford Tech. Report 2(11), 1–11 (2005).
4. B. Wilburn, N. Joshi, and V. Vaish, “High performance imaging using large camera arrays,” ACM Trans. Graph. 24(3), 765–776 (2005). [CrossRef]
5. S. Chan, H. Shum, and K. Ng, “Image-Based Rendering and Synthesis,” IEEE Signal Process. Mag. 24(6), 22–33 (2007). [CrossRef]
6. G. Wu, M. Zhao, and L. Wang, “Light Field Reconstruction Using Deep Convolutional Network on EPI,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2017), pp. 6319–6327.
7. D. Ji, J. Kwon, and M. Mcfarland, “Deep View Morphing,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2017), pp. 7092–7100.
8. T. Zhou, S. Tulsiani, and W. Sun, “View Synthesis by Appearance Flow,” in Proceedings of European Conference on Computer Vision, (Springer, 2016), pp. 286–301.
9. B. Mildenhall, P. Srinivasan, and M. Tancik, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” in Proceedings of European Conference on Computer Vision, (Springer, 2020).
10. J. Flynn, I. Neulander, and J. Philbin, “Deep Stereo: Learning to Predict New Views from the World's Imagery,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2016), pp. 5515–5524.
11. Z. Xu, S. Bi, and K. Sunkavalli, “Deep view synthesis from sparse photometric images,” ACM Trans. Graph. 38(4), 1–13 (2019). [CrossRef]
12. T. Zhou, R. Tucker, and J. Flynn, “Stereo Magnification: Learning View Synthesis using Multiplane Images,” ACM Trans. Graph. 37(4), 1–12 (2018). [CrossRef]
13. P. Srinivasan, R. Tucker, and J. Barron, “Pushing the Boundaries of View Extrapolation With Multiplane Images,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 175–184.
14. J. Flynn, M. Broxton, and P. Debevec, “DeepView: View Synthesis With Learned Gradient Descent,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 2362–2371.
15. B. Mildenhall, P. Srinivasan, and R. Ortiz-Cayon, “Local light field fusion: Local light field fusion: practical view synthesis with prescriptive sampling guidelines,” ACM Trans. Graph. 38(4), 1–14 (2019). [CrossRef]
16. D. Chen, X. Sang, and P. Wang, “Multi-parallax views synthesis for three-dimensional light-field display using unsupervised CNN,” Opt. Express 26(21), 27585–27598 (2018). [CrossRef]
17. D. Chen, X. Sang, and P. Wang, “Dense-view synthesis for three-dimensional light-field display based on unsupervised learning,” Opt. Express 27(17), 24624–24641 (2019). [CrossRef]
18. E. Penner and L. Zhang, “Soft 3D reconstruction for view synthesis,” ACM Trans. Graph. 36(6), 1–11 (2017). [CrossRef]
19. N. Guo, X. Sang, and S. Xie, “Efficient Image Warping in Parallel for Multiview Three-Dimensional Displays,” J. Disp. Technol. 12(11), 1335–1343 (2016). [CrossRef]
20. D. Kim, S. Woo, and J. Lee, “Deep Video Inpainting,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 5785–5794.
21. D. Pathak, P. Krähenbühl, and J. Donahue, “Context Encoders: Feature Learning by Inpainting,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2016), pp. 2536–2544.
22. G. Chaurasia, S. Duchene, and O. Sorkine-Hornung, “Depth Synthesis and Local Warps for Plausible Image-based Navigation,” ACM Trans. Graph. 32(3), 1–12 (2013). [CrossRef]
23. X. Yu, X. Sang, X. Gao, D. Chen, B. Liu, L. Liu, C. Gao, and P. Wang, “Dynamic three-dimensional light-field display with large viewing angle based on compound lenticular lens array and multi-projectors,” Opt. Express 27(11), 16024–16031 (2019). [CrossRef]
24. X. Yu, X. Sang, X. Gao, B. Yan, D. Chen, B. Liu, L. Liu, C. Gao, and P. Wang, “360-degree tabletop 3D light-field display with ring-shaped viewing range based on aspheric conical lens array,” Opt. Express 27(19), 26738–26748 (2019). [CrossRef]
25. X. Li, M. Zhao, Y. Xing, H. Zhang, L. Li, S. Kim, X. Zhou, and Q. Wang, “Designing optical 3D images encryption and reconstruction using monospectral synthetic aperture integral imaging,” Opt. Express 26(9), 11084–11099 (2018). [CrossRef]
26. Y. Xing, Y. Xia, S. Li, H. Ren, and Q. Wang, “Annular sector elemental image array generation method for tabletop integral imaging 3D display with smooth motion parallax,” Opt. Express 28(23), 34706–34716 (2020). [CrossRef]