Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

DGE-CNN: 2D-to-3D holographic display based on a depth gradient extracting module and ZCNN network

Open Access Open Access

Abstract

Holography is a crucial technique for the ultimate three-dimensional (3D) display, because it renders all optical cues from the human visual system. However, the shortage of 3D contents strictly restricts the extensive application of holographic 3D displays. In this paper, a 2D-to-3D-display system by deep learning-based monocular depth estimation is proposed. By feeding a single RGB image of a 3D scene into our designed DGE-CNN network, a corresponding display-oriented 3D depth map can be accurately generated for layer-based computer-generated holography. With simple parameter adjustment, our system can adapt the distance range of holographic display according to specific requirements. The high-quality and flexible holographic 3D display can be achieved based on a single RGB image without 3D rendering devices, permitting potential human-display interactive applications such as remote education, navigation, and medical treatment.

© 2023 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

Holography can encode a three-dimensional (3D) wavefront as an interference pattern of variations into both phase and amplitude. The intensity and depth cues, can be reconstructed by interference fringes under the illumination of the reference light [1,2]. Since holography can render all optical cues from the human visual system (HVS), it is regarded as a crucial part in the ultimate 3D display [3,4]. It has been attractive for many virtual reality (VR) and augmented reality (AR) applications, such as remote education [5], spatial cognition and navigation [6], and medical treatment [7,8], etc. These computer-generated holograms (CGHs) are calculated from various 3D representations of the scene in the form of the point cloud, depth map, or light-field data [9].

In order to compute high-quality CGH, accurate 3D information is desperately needed. Currently, the difficulty of accessing 3D content has greatly restricted the applications and commercialization of holographic 3D displays. The acquisition technologies including time-of-flight (TOF) [10], light-field [11], and structured illumination [12] are considered as efficient ways to acquire 3D depth cues. However, the employment of these 3D acquisition devices may lead to challenges like high cost, insufficient resolution, and high system complexity. Recently, an end-to-end CGH generation network has also been brought up, which successfully bypass the 3D lacking dilemma and predict the 3D CGH directly from 2D input [13]. However, such network can only generate the CGH corresponding to a fixed distance range of optical reconstruction, and for another distance range it has to be retrained with a different set of parameters, which restricts its applications for many human-display interactive scenarios, for example when people want to stretch the 3D volume and change the thickness of it. The requirement of 3D CGH with both quality and flexibility brings us back to the problem of 3D acquisition, and its combination with holographic display.

The 2D-to-3D rendering approaches are regarded as promising options to enrich 3D contents for holographic display. In the field of computer vision, the unique features about the 2D images can be exploited to retrieve 3D information, including texture variations, color, haze, motion, and defocus [8,1417], etc. Current work that combine 2D-to-3D rendering approach with holographic 3D display also follow this method, extracting corresponding features to predict depth maps from input 2D images. However, such method only works well for simple scenes with obvious depth cues or computer-modeled objects without a complex background. There is still a big challenge for it to predict the depth maps for the scenes with varied depth cues or complicated backgrounds. Recently, with the developments of machine learning and neural network, learning-based 2D-to-3D rendering approaches have been considered as an effective method to handle monocular depth estimation (MDE), which enables us to enrich 3D content for holographic display.

Classic learning-based MDE methods have focused on hand-crafted features that contain depth cues. With some assumptions on the scene geometry, probabilistic models are trained and utilized to make inferences on the third dimension [15,1820]. As deep convolutional neural network (CNN) continues to provide excellent results in computer vision tasks, it also reaches an overall high accuracy for the MDE task by directly modeling it as a regression or classification problem [2126]. With proper tuning towards better match with HVS, CNN-based MDE method is promising to be applied in effective 3D content generation and holographic display of the scenes captured with a mobile phone camera.

In this paper, we propose a 2D-to-3D holographic display system using deep-learning MDE method. By implementing a depth gradient extracting module (DGE) in the CNN, named DGE-CNN, a high-quality display-oriented depth map can be generated based on the single RGB image. The obtained computer-generated depth map is then applied in a layer-oriented angular-spectrum algorithm to calculate the corresponding CGH. The final 3D display turned out to have prominent depth variation with the proposed method. The DGE-CNN provides an efficient, effective, and high-quality 3D reconstruction system with great simplicity, flexibility and generalization ability.

2. Generation of depth maps

The structure of the proposed DGE-CNN is shown in Fig. 1(a). The input data is a single RGB 2D image. The last few fully-convolutional layers are removed in a classic Resnet-50 network, which serves as our encoder. It can achieve a receptive field of 427 × 427 for the encoding section and cover the input image (304 × 228), even for a higher resolution. After the initial CNN downsampling, residual downsample block (B1) and residual projection block (B2) allow the network to go deeper without facing degradation or vanishing gradients problems, thus increasing the understanding ability of the model [23,27]. Following the same strategy, a series of up-sampling residual blocks are utilized for restoration, in which the Up-resconv block applies similar shortcut connections to improve the prediction and generalization performance of the network. In the network, the encoder detects and captures the features of the comprehensive 3D information of the input image, while the decoder acts as a processor that generates corresponding depth maps from these features. Detailed structures of all blocks are depicted in Figs. 1(b)-1(f).

 figure: Fig. 1.

Fig. 1. (a) Structure of the DGE-CNN. The proposed structure is composed of a classic Resnet-50 network and up-sampling blocks. A depth gradient extracting module is implemented in the network during the back propagation. (b) CNN downsample block. (c) B1: residual downsample block. (d) B2: residual projection block. (e) CNN up-sampling block. (f) Up-Resconv Block: residual down-sampling block.

Download Full Size | PDF

When high-quality depth maps are required for display missions, the special characteristics of HVS need to be given extra attention. It is acknowledged that human eyes are sensitive to boundary of scene, which commonly shows high contrast of pixels of images [28,29]. Depth variation of the input image is one major fact that leads to such contrast and should be specifically considered in the situation of 3D reconstruction. In this work, we designed a DGE module that generates a depth gradient map from the obtained depth map to represent depth variation. Then we implemented the DGE module in the proposed network as training guidance. The DGE module is designed as two branched Sobel convolution operators with untrainable constant parameters, as is shown in Fig. 2. Gradients in both x and y axes are extracted by the two operators separately. They are integrated into a complete depth gradient map, which can be back propagated.

 figure: Fig. 2.

Fig. 2. Sobel operators for edge detection. ${G_x}\; $ operator extracts the gradient for x axis while ${G_y}\textrm{ }$ operator extracts the gradient for y axis.

Download Full Size | PDF

The training of the DGE-CNN is performed in a branching structure. More specifically, in the case of the NYU Depth v2 dataset [30], the network propagates forward and gives two outputs from a single input image with the number of pixels 160 × 128 and 158 × 126, i.e. the predicted depth map and its depth gradient map. Accordingly, there are two labels corresponding to the forward inference of our network. We use DGE to generate the ground-truth depth gradient map from the ground-truth depth map. The loss function is defined based on these two feature maps,

$$l(y,{y_d};{y^\ast },y_d^\ast ) = \frac{1}{{{N_t}}}\sum\nolimits_{y \in T} {[{{{||{y - {y^\ast }} ||}_2} + {\lambda_1}{{||{{y_{dx}} - {y_{dx}}^\ast } ||}_1} + {\lambda_2}{{||{{y_{dy}} - {y_{dy}}^\ast } ||}_1}} ]}$$
where ${|{|\ast |} |_1}$ and ${|{|\ast |} |_2}\; $ denote the L1 and L2 loss between the predicted images and the ground-truth images, respectively. ${y^\ast },\; y_{dx}^\ast ,\; y_{dy}^\ast $ represent the ground-truth depth map and their corresponding x-axis and y-axis depth gradient maps (${\lambda _1},{\lambda _2}\; $ are set as 0.2 during training), respectively. T is the collection of all images in each batch and ${N_t}$ represents the batch size. The network is implemented in Pytorch 1.9.0 on a NVIDIA Quadro GV100 GPU platform. The proposed model is initialized based on pre-trained Resnet-50 parameters. The batch size is set as 16, and the learning rate is started at 0.001 and decreased by 0.8 every five epochs.

To visualize the effect of the depth gradient extractor module, the contrast experiments are conducted with and without DGE as the training guidance. To make comparisons, classic L1 and L2 losses are used. As shown in Fig. 3, after using the DGE module, the boundaries in the predicted depth map are efficiently sharpened and clearly recovered to a satisfactory level, which leads to a better display effect for human visual characteristics. Pixels of low value in the depth gradient maps are set to zero so that the advantage of the DGE module in recovering high spatial frequency information can be shown. Besides, we also apply the metrics which are widely employed in overall performance evaluation for MDE task, defined as follows,

$$\textrm{RMS evaluation:}\,{S_1} = \sqrt {\frac{1}{{MN}}\sum\nolimits_{m,n} {{{|{y(m,n) - {y^\ast }(m,n)} |}^2}} }$$
$$\textrm{ABS evaluation:}\,{S_2} = \sqrt {\frac{1}{{MN}}\sum\nolimits_{m,n} {\left|{\frac{{y(m,n) - {y^\ast }(m,n)}}{{{y^\ast }(m,n)}}} \right|} }$$
$$\delta \,\textrm{Accuracy:}\,{S_3} = {\text{ }}\% {\text{ }}\,of\,{y_i}\,s.t.\, \max \left( {\frac{y}{{{y^*}}},\frac{{{y^*}}}{y}} \right) = \delta < {1.25^{th}}$$
where m and n are integer numbers, M and N are the numbers of pixels in the depth maps.

 figure: Fig. 3.

Fig. 3. (a) Input RGB image. (b) Ground truth of the depth map. (c)(f) Predicted depth map and gradient map with L1 loss. (d)(g) Predicted depth map and gradient map with L2 loss. (e)(h) Predicted depth map and gradient map with L2 loss and DGE.

Download Full Size | PDF

We also compared our proposed network with the classical MDE method FCRN [23], which used Resnet-50 as the backbone encoder as well and thus sharing similar number of network parameters as our DGE-CNN. Other than the evaluation metrics listed in Table 1, we also calculate the root mean square (RMS) value and accuracy of the gradient maps of both networks’ predictions, noted as RMS-gd and $\mathrm{\delta}\textrm{-gd}$ respectively.

Tables Icon

Table 1. Performances for different loss functions

As is shown in Table 1, the root mean square (RMS) evaluation and absolute relative (ABS) evaluation for L2-DGE are 0.533 and 0.142, respectively, which is pretty lower than for both L1 and L2. The $\mathrm{\delta}$ accuracy evaluations of L2-DGE also slightly exceeds those of L1 and L2. Additionally, Table 2 shows that the RMS-gd and $\mathrm{\delta}\textrm{-gd}$ of DGE-CNN are 1.096 and 0.583, respectively, which demonstrates a huge performance advance in retrieving boundary information compared to FCRN. Unsurprisingly, DGE module shows great ability in sharpening boundaries of the recovered depth map, meanwhile provides some optimization for overall estimating accuracy. Our DGE-CNN network can predict HVS-oriented depth maps with high accuracy and enhanced boundary, which can be generalized to the most scenes.

Tables Icon

Table 2. Performances comparison with FCRN

3. Holographic display

Given the depth map of the 3D scene from the network, we further calculate a hologram using the layer-based angular-spectrum algorithm [31]. The obtained 3D model is first sliced into 50 layers for the further computing of the hologram. This process follows a straight slicing-by-depth method, which may cause a relative weakening of detailed depth information in near scene when pixels of the original RGB picture is distributed unevenly on the corresponding depth map. We proposed a pixel-average slicing method to make sure that each layer contains the same number of pixels. In this way, the effective intensity of the light field can be distributed more evenly, which guarantees that near scene is given more attention during display. With an indoor RGB image taken by Huawei Mate30 Pro mobile phone (3840${\times} $2592) as the input, Fig. 4 shows the detailed slicing process with five layers as an example. In this case, with the traditional depth-average slicing method, the near tripod is displayed with a low depth resolution of only two layers. In the meantime, the far scene that people commonly do not notice is given an excessive three depth layers. On the contrary, with our pixel-average slicing method, more details are shown for the near tripod with a high depth resolution of about four layers. The far scene depth resolution is properly diminished.

 figure: Fig. 4.

Fig. 4. Pixel-average slicing method and depth-average slicing method

Download Full Size | PDF

For each layer of the grayscale image, a random phase r (x, y) is superposed to simulate the diffusive effect of the object surface. With the angular-spectrum diffraction theory, both non-paraxial fields and paraxial fields can be simulated with a high accuracy. The complex amplitude distribution on the holographic plane ${E_h}({x,y} )$ is calculated as follows:

$${E_h}(x,y) = \sum {\left( {F{T^{ - 1}}\left\{ {FT\{{{U_i}(x,y)\exp [{ir(x,y)} ]} \}\exp \left[ {\frac{{j2\pi {z_i}\sqrt {1 - {{(\lambda u)}^2} - {{(\lambda v)}^2}} }}{\lambda }} \right]} \right\}} \right)}$$
$$u = \frac{{\cos \alpha }}{\lambda },\textrm{ }v = \frac{{\cos \beta }}{\lambda }$$
where FT represents the Fourier transform, ${U_i}({x,y} )$ is the amplitude of the i-th layer, ${z_i}$ is the distance between the i-th layer and the holographic plane, $\lambda $ is the wavelength, u and v are the spatial frequencies, and $\alpha $ and $\beta $ are the angles between the incident wave and the x- and y- axis, respectively. As phase-only spatial light modulator is used, the phase distribution $\varphi ({x,y} )$ is then extracted from the complex amplitude ${E_h}({x,y} )$. By adjusting the value of ${z_i}$ for each layer, the original scene can be freely stretched with a controllable 3D display range. The whole CGH computational generation process is depicted in Fig. 5(a).

 figure: Fig. 5.

Fig. 5. (a) Illustration of the 2D-to-3D holographic display system that generates the phase-only holograms starting from a single 2D RGB image. (b) Photograph of the optical setup. (c) Schematic diagram of the holographic 3D display system. (d) Reconstructions of a 3D scene at distances ranging from 0.3 m to 0.33 m.

Download Full Size | PDF

The optical setup is shown in Fig. 5(b). A coherent beam was attenuated, expanded, and polarized before illuminating the SLM. A Holoeye GAEA-2 phase-only SLM with the pixel number of 3840${\times} $2160 was employed. The pixel pitch of the SLM is 3.74 $\mathrm{\mu m}$. The phase-only hologram of the target scene was uploaded to the SLM. The illumination wavelength is 532 nm. A Cannon 60D digital camera was used to capture the reconstructed 3D scenes. The schematic diagram of the holographic 3D display system is shown in Fig. 5(c). The 3D reconstructions were captured at the distances ranging from 0.3 m to 0.33 m, in which the focus area gradually shifted from near to far sections, as is shown in Fig. 5(d). In this case the optical reconstruction possesses a rather large depth of field, covering a 3 cm range along the optical path.

To further demonstrate the generalization capability of our 2D-to-3D holographic display system, we applied the system to outdoor scenes. With the Resnet-MDE depth predictor network trained with the Make3D dataset [15], the desired 3D reconstruction is realized for outdoor scenes taken by a Huawei Mate30 pro mobile phone. Reconstructions of 3D scenes from NYU Depth V2 dataset, indoor and outdoor scenes were shown in Fig. 6. In this case we narrow down the display depth of field and set the distance range from 0.3 m to 0.31 m to visually show the display effect of our 2D-to-3D display system. In the zoomed picture of 3D reconstruction, it is shown that the near and far objects are focused at the correct distances.

 figure: Fig. 6.

Fig. 6. The 3D reconstruction of different scenes captured at the front and rear focus planes. (a) Image from NYU V2 dataset and its reconstruction. (b)(c) Indoor scenes taken by mobile phone and their reconstruction. (d) Outdoor scene captured by mobile phone and its reconstruction.

Download Full Size | PDF

4. Discussions

Current 2D-to-3D holographic display work that follows similar procedure takes a different approach in depth map prediction [8]. It classified natural images into three categories: distant view images, perspective view images, and close-up images, and generate the corresponding depth maps by color space transformation, vanishing line detection, and occlusion prediction, respectively. This method successfully detects available depth cues, but fails to work when the given scene becomes complicated, like the tripod scene displayed in Fig. 5(d). In this picture exists elements of all three categories according to its classification, which will cause confusion, leading to unsuccessful depth recovery and poor holographic reconstruction. On the contrary, our proposed DGE-CNN takes a comprehensive understanding of both indoor and outdoor images without any preprocessing like classification, and predicts the corresponding boundary-enhanced depth maps with high accuracy and better match with HVS.

In our system, the average processing time for depth map generation, pixel-average slicing and CGH generation are 174.97 ms, 73.42 ms, and 1201.67 ms, respectively. Among the three steps, layer-based CGH generation contributed most to total time consumption. Recently, various deep learning networks have been developed for real-time computational holography [3234], which provides a powerful optimization by replacing simulated optical diffraction with series of convolutional block. The end-to-end network has also been brought up for real-time 3D CGH generation [13]. The established networks share a common defect that their optical distance range of 3D reconstruction is settled according to particular datasets. In our work we stick to angular spectrum propagation as CGH generation method, guaranteeing the flexibility of our 2D-to-3D display system. Meanwhile, by proposing the pixel-average slicing technique, we successfully overcome the former setback of layer-based angular spectrum method that may occur when pixels of input images are distributed unevenly on depth axis. As we have demonstrated in Fig. 5(d). and Fig. 6, by changing parameters of each layer’s propagation distance, we can stretch the displayed 3D volume from 10 mm to 30 mm in length and rebuild the display at an acceptable cost of time, boosting potentials in all kinds of interactive display. To reach real-time speed and flexibility simultaneously, further research could be conducted towards an integrated end-to-end dynamic 3D CGH generation network with multiple adjustable parameters as input.

5. Conclusions

In this paper, we have presented a 2D-to-3D holographic display system that can generate the CGH and present 3D reconstruction directly from a single RGB 2D image using a real-time DGE-CNN network. To acquire depth information from the monocular input image, we trained a Resnet-based convolutional neural network to generate a depth map with high depth resolution, which works well in both indoor and outdoor scenes. Meanwhile, a DGE module is designed and applied to the network to reinforce the boundary information for a better fit for the human visual system. With the pixel-average slicing and the layer-based angular spectrum algorithm, we can generate CGH for 3D holographic display and achieve desired reconstruction in a controllable range of distance, depending on only one picture captured by a mobile phone camera. This work provides an efficient and effective way to enlarge holographic 3D contents. No bulky device is included in the whole process of generating 3D holograms starting from single 2D RGB image, boosting potentials in multi-scene 3D holography and its daily applications in the interactive display.

Funding

National Natural Science Foundation of China (62035003); Spark Project at Tsinghua University.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [35].

References

1. J. Geng, “Three-dimensional display technologies,” Adv. Opt. Photonics 5(4), 456–535 (2013). [CrossRef]  

2. J. P. Goodman, Introduction to Fourier Optics, (W. H. Freeman, 2017).

3. J. Hong, Y. Kim, H. J. Choi, J. Hahn, J. H. Park, H. Kim, S. W. Min, N. Chen, and B. Lee, “Three-dimensional display technologies of recent interest: principles, status, and issues [Invited],” Appl. Opt. 50(34), H87–115 (2011). [CrossRef]  

4. P. Blanche, “Holography, and the future of 3D display,” Light: Advanced Manufacturing. 2(4), 1 (2021). [CrossRef]  

5. J. Keil, D. Edler, and F. Dickmann, “Preparing the HoloLens for user Studies: an augmented reality interface for the spatial adjustment of holographic objects in 3D indoor environments,” KN J. Cartogr. Geogr. Inf. 69(3), 205–215 (2019). [CrossRef]  

6. J. Keil, A. Korte, A. Ratmer, D. Edler, and F. Dickmann, “Augmented reality (AR) and spatial cognition: effects of holographic grids on distance estimation and location memory in a 3D indoor scenario,” J. Photogramm. Remote Sens. Geoinf. Sci. 88(2), 165–172 (2020). [CrossRef]  

7. C. Moro, C. Phelps, P. Redmond, and Z. Stromberga, “HoloLens and mobile augmented reality in medical and health science education: A randomised controlled trial,” Br. J. Educ. Technol. 52(2), 680–694 (2021). [CrossRef]  

8. Z. He, X. Sui, and L. Cao, “Holographic 3D display using depth maps generated by 2D-to-3D rendering approach,” Appl. Sci. 11(21), 9889 (2021). [CrossRef]  

9. A. G. Marrugo, F. Gao, and S. Zhang, “State-of-the-art active optical techniques for three-dimensional surface metrology: a review [Invited],” J. Opt. Soc. Am. A 37(9), B60–B77 (2020). [CrossRef]  

10. Z. Khan, J.-C. Shih, R.-L. Chao, T.-L. Tsai, H.-C. Wang, G.-W. Fan, Y.-C. Lin, and J.-W. Shi, “High-brightness and high-speed vertical-cavity surface-emitting laser arrays,” Optica 7(4), 267–275 (2020). [CrossRef]  

11. M. Yamaguchi, “Light-field and holographic three-dimensional display,” J. Opt. Soc. Am. A 33(12), 2348–2364 (2016). [CrossRef]  

12. F. Zhong, R. Kumar, and C. Quan, “A cost-effective single-shot structured light system for 3D shape measurement,” IEEE Sensors J. 19(17), 7335–7346 (2019). [CrossRef]  

13. C. Chang, B. Dai, D. Zhu, J. Li, J. Xia, D. Zhang, L. Hou, and S. Zhuang, “From picture to 3D hologram: end-to-end learning of real-time 3D photorealistic hologram generation from 2D image input,” Opt. Lett. 48(4), 851–854 (2023). [CrossRef]  

14. A. Torralba and A. Oliva, “Depth estimation from image structure,” IEEE Trans. Pattern Anal. Machine Intell. 24(9), 1226–1238 (2002). [CrossRef]  

15. A. Saxena, M. Sun, and A. Y. Ng, “Make3D: Learning 3D scene structure from a single still image,” IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009). [CrossRef]  

16. T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6612–6619 (2017).

17. N. Kong and M. J. Black, “Intrinsic depth: improving depth transfer with intrinsic images,” in 2015 IEEE International Conference on Computer Vision (ICCV), 3514–3522 (2015).

18. A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from single monocular images,” in Proceedings of the 18th International Conference on Neural Information Processing Systems (NIPS), (MIT Press, Vancouver, British Columbia, Canada, 2005), pp. 1161–1168.

19. J. I. Jung and Y. S. Ho, “Depth map estimation from single-view image using object classification based on Bayesian learning,” in 2010 3DTV-Conference: The True Vision - Capture, Transmission and Display of 3D Video, 1–4 (2010).

20. M. Liu, M. Salzmann, and X. He, “Discrete-continuous depth estimation from a single image,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 716–723 (2014).

21. X. Dong, M. A. Garratt, S. G. Anavatti, and H. A. Abbass, “Towards real-time monocular depth estimation for robotics: a survey,” IEEE Trans. Intell. Transport. Syst. 1–10 (2022).

22. D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS) - Volume 2, (MIT Press, Montreal, Canada, 2014), pp. 2366–2374.

23. I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth International Conference on 3D Vision (3DV), 239–248 (2016).

24. Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE Trans. Circuits Syst. Video Technol. 28(11), 3174–3182 (2018). [CrossRef]  

25. D. Wofk, F. Ma, T. J. Yang, S. Karaman, and V. Sze, “FastDepth: fast monocular depth estimation on embedded systems,” in 2019 International Conference on Robotics and Automation (ICRA), 6101–6108 (2019).

26. S. F. Bhat, I. Alhashim, and P. Wonka, “AdaBins: depth estimation using adaptive bins,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4008–4017 (2021).

27. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778 (2016).

28. W. Zhou, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]  

29. V. V. Bezzubik and N. R. Belashenkov, “Modeling the contrast-sensitivity function of the human visual system,” J. Opt. Technol. 82(10), 711–717 (2015). [CrossRef]  

30. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Proceedings of the 12th European conference on Computer Vision (ECCV) - Volume Part V, (Springer-Verlag, 2012), pp. 746–760.

31. Y. Zhao, L. Cao, H. Zhang, D. Kong, and G. Jin, “Accurate calculation of computer-generated holograms using angular-spectrum layer-oriented method,” Opt. Express 23(20), 25440–25449 (2015). [CrossRef]  

32. L. Shi, B. Li, C. Kim, P. Kellnhofer, and W. Matusik, “Towards real-time photorealistic 3D holography with deep neural networks,” Nature 591(7849), 234–239 (2021). [CrossRef]  

33. L. Shi, B. Li, and W. Matusik, “End-to-end learning of 3D phase-only holograms for holographic display,” Light: Sci. Appl. 11(1), 247 (2022). [CrossRef]  

34. K. Liu, J. Wu, Z. He, and L. Cao, “4K-DMDNet: diffraction model-driven network for 4 K computer-generated holography,” Opto-Electron. Adv. 6(5), 220135 (2023). [CrossRef]  

35. N. Liu, Z. Huang, Z. He, and L. Cao, “ DGE-CNN: 2D-to-3D holographic display based on depth gradient extracting module and CNN network,” GitHub, 2023, https://github.com/lnhtsinghua/DGE-CNN-demo.

Data availability

Data underlying the results presented in this paper are available in Ref. [35].

35. N. Liu, Z. Huang, Z. He, and L. Cao, “ DGE-CNN: 2D-to-3D holographic display based on depth gradient extracting module and CNN network,” GitHub, 2023, https://github.com/lnhtsinghua/DGE-CNN-demo.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (6)

Fig. 1.
Fig. 1. (a) Structure of the DGE-CNN. The proposed structure is composed of a classic Resnet-50 network and up-sampling blocks. A depth gradient extracting module is implemented in the network during the back propagation. (b) CNN downsample block. (c) B1: residual downsample block. (d) B2: residual projection block. (e) CNN up-sampling block. (f) Up-Resconv Block: residual down-sampling block.
Fig. 2.
Fig. 2. Sobel operators for edge detection. ${G_x}\; $ operator extracts the gradient for x axis while ${G_y}\textrm{ }$ operator extracts the gradient for y axis.
Fig. 3.
Fig. 3. (a) Input RGB image. (b) Ground truth of the depth map. (c)(f) Predicted depth map and gradient map with L1 loss. (d)(g) Predicted depth map and gradient map with L2 loss. (e)(h) Predicted depth map and gradient map with L2 loss and DGE.
Fig. 4.
Fig. 4. Pixel-average slicing method and depth-average slicing method
Fig. 5.
Fig. 5. (a) Illustration of the 2D-to-3D holographic display system that generates the phase-only holograms starting from a single 2D RGB image. (b) Photograph of the optical setup. (c) Schematic diagram of the holographic 3D display system. (d) Reconstructions of a 3D scene at distances ranging from 0.3 m to 0.33 m.
Fig. 6.
Fig. 6. The 3D reconstruction of different scenes captured at the front and rear focus planes. (a) Image from NYU V2 dataset and its reconstruction. (b)(c) Indoor scenes taken by mobile phone and their reconstruction. (d) Outdoor scene captured by mobile phone and its reconstruction.

Tables (2)

Tables Icon

Table 1. Performances for different loss functions

Tables Icon

Table 2. Performances comparison with FCRN

Equations (6)

Equations on this page are rendered with MathJax. Learn more.

$$l(y,{y_d};{y^\ast },y_d^\ast ) = \frac{1}{{{N_t}}}\sum\nolimits_{y \in T} {[{{{||{y - {y^\ast }} ||}_2} + {\lambda_1}{{||{{y_{dx}} - {y_{dx}}^\ast } ||}_1} + {\lambda_2}{{||{{y_{dy}} - {y_{dy}}^\ast } ||}_1}} ]}$$
$$\textrm{RMS evaluation:}\,{S_1} = \sqrt {\frac{1}{{MN}}\sum\nolimits_{m,n} {{{|{y(m,n) - {y^\ast }(m,n)} |}^2}} }$$
$$\textrm{ABS evaluation:}\,{S_2} = \sqrt {\frac{1}{{MN}}\sum\nolimits_{m,n} {\left|{\frac{{y(m,n) - {y^\ast }(m,n)}}{{{y^\ast }(m,n)}}} \right|} }$$
$$\delta \,\textrm{Accuracy:}\,{S_3} = {\text{ }}\% {\text{ }}\,of\,{y_i}\,s.t.\, \max \left( {\frac{y}{{{y^*}}},\frac{{{y^*}}}{y}} \right) = \delta < {1.25^{th}}$$
$${E_h}(x,y) = \sum {\left( {F{T^{ - 1}}\left\{ {FT\{{{U_i}(x,y)\exp [{ir(x,y)} ]} \}\exp \left[ {\frac{{j2\pi {z_i}\sqrt {1 - {{(\lambda u)}^2} - {{(\lambda v)}^2}} }}{\lambda }} \right]} \right\}} \right)}$$
$$u = \frac{{\cos \alpha }}{\lambda },\textrm{ }v = \frac{{\cos \beta }}{\lambda }$$
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.