Deep learning-based end-to-end 3D depth recovery from a single-frame fringe pattern with the MSUNet++ network

Chao Wang; Pei Zhou; Pei Zhou; Jiangping Zhu; Jiangping Zhu

doi:10.1364/OE.501067

1. Introduction

Digital fringe projection profilometry (FPP) reconstructs the 3D point cloud of an object by projecting coded fringe patterns onto the surface of the object and capturing the deformed fringe image modulated by the object, and then the 3D point cloud of the object can be generated through the steps of fringe analysis, phase analysis, phase-height mapping, and camera calibration. As a low-cost, full-field, high-speed, and high-precision non-contact 3D measurement method, it has been widely used in fields such as geometric measurement, cultural relics restoration, reverse engineering, etc. [1–7].

Traditional FPP techniques mainly consist of two methods, phase measurement profilometry (PMP) in the time domain and Fourier transform profilometry (FTP) in the spatial domain. In general, both methods have their strengths and limitations. PMP [8] requires the projection of at least three phase-shifted fringe patterns, providing higher spatial resolution and phase measurement accuracy without considering the object’s spatial isolation. It is also more robust against environmental light and variations in object surface reflectivity. FTP [9], on the other hand, is based on spatial filtering and only requires a single frame of fringe pattern to obtain surface information of the measured object, so it has advantages in real-time data acquisition and 3D measurement of simple and continuous geometry shape in a dynamic process but may suffer from the problem of spectrum aliasing and details loss.

With the rapid development of deep learning (DL), it has been widely applied in many fields such as optical imaging and computational imaging [10–13]. And the FPP technology is no longer limited to traditional methods as well. Nowadays, DL has also been applied to all stages of FPP-based 3D reconstruction. Shi et al. [14] proposed a DL-based method for enhancing fringe information, which improved the reconstruction accuracy. Zuo et al. [15] used DL to simulate the phase demodulation process of traditional phase-shifting methods, compared with the traditional single-frame phase demodulation method, more accurate phase distribution results were obtained. Nguyen et al. [16] introduced a single-input dual-output method that predicted multiple phase-shifted fringe patterns and phase orders from a single fringe pattern, and the method could achieve dynamic reconstruction. However, these methods mostly focus on a specific step within the traditional FPP technology.

Certainly, end-to-end DL methods from single fringe pattern to complete 3D surface reconstruction have also been gradually developed recently. Sam Vam Der Jeught et al. [17] demonstrated the feasibility of using DL for end-to-end FPP by generating simulated fringe-height data pairs and validating them using a multi-layer CNN. Cheng et al. [18] simulated modulation-based structured light height measurement method by DL, showing higher accuracy and efficiency compared to the traditional FTP method. Wang et al. [19] generated a large number of datasets of fringe patterns and surface heights with the help of Blender software, and validated the feasibility of end-to-end reconstruction methods using CNN and GAN networks. Recently, Wu et al. [20] incorporated traditional discrete wavelet transform (DWT) into the end-to-end DL FPP reconstruction method, achieving excellent results as well. While these methods have achieved satisfactory results, most of them were focused on network architecture design and increasing more parameters to improve the reconstruction accuracy. However, data preprocessing and loss functions are equally important, and likewise, a better feature extraction method is necessary as well since all of them have vital impacts on the performance of the training network.

A comprehensive approach has been introduced in this paper to enhance the accuracy of fringe-to-3D depth estimation using end-to-end DL methods. Firstly, we develop the MSUNet++ model, a multi-scale CNN that simultaneously considers high-level and low-level features. Additionally, we design a multi-scale feature extraction and fusion module to expand receptive fields, improve network perception, and effectively learn depth information. Secondly, we introduce the DWT as a preprocessing step during network input, aiming to amplify the importance of crucial high-frequency signals for 3D depth reconstruction. This incorporation enables more effective extraction of detailed image information, ultimately boosting reconstruction accuracy. Lastly, we introduce a novel loss function that merges structural similarity and edge perception, adapting it to the task and enhancing reconstruction quality. By strategically combining these network strategies, we anticipate significant improvements in the accuracy of end-to-end DL methods for fringe-to-3D depth estimation. We validate the feasibility and effectiveness of our approach through ablation and generalization experiments.

2. Methodology

2.1 Fringe projection profilometry (FPP)

FPP [21] technology is widely recognized as one of the most reliable and accurate methods for 3D reconstruction. This technique involves projecting a series of sinusoidal fringe patterns onto the surface of an object and utilizing the phase information from the deformed fringe patterns captured by a camera, which is modulated by the object’s height. By combining this phase information with system calibration parameters, the technology accomplishes the task of 3D reconstruction. Taking the N-step phase-shifting method as an example, the sinusoidal encoded fringe patterns captured by the camera can be represented as follows:

(1)$$I_i(x,y)=A(x,y)+B(x,y)cos(\varphi(x,y)-\frac{2\pi i}{N})$$

where $(x,y)$ represents the pixel coordinates, $I_i(x,y)$ denotes the captured fringe pattern intensity, $A(x,y)$ is the average intensity, $B(x,y)$ represents the modulation, $\varphi (x,y)$ is the desired phase information, and $i$ indicates the phase-shifting step $(i=1,2,\ldots,N)$. And the wrapped phase $\varphi (x,y)$ can be described as follows by utilizing the aforementioned multi-step phase-shifting information:

(2)$$\varphi(x,y)=arctan\frac{\sum_{i=1}^N I_i(x,y)sin\frac{2\pi i}{N}}{\sum_{i=1}^N I_i(x,y)cos\frac{2\pi i}{N}}$$

where the phase $\varphi (x,y)$ is wrapped within the range of $[-\pi, \pi ]$. Therefore, it is necessary for some algorithms, like multi-frequency temporal phase unwrapping (TPU) algorithm [22], to unwrap $\varphi (x,y)$ into the continuous phase $\phi (x,y)$ that accurately reflects the object height.

(3)$$\phi(x,y)=\varphi(x,y)+2\pi k(x,y)$$

where k denotes the fringe order determined by the TPU algorithm. Finally, the continuous phase $\phi (x,y)$ can be converted into height (depth) using phase-height mapping algorithm [23]. The height $Z(x,y)$ can be expressed as a polynomial of phase:

(4)$$Z(x,y)=c_0+c_1\phi(x,y)+c_2\phi(x,y)^2+\cdots+c_n\phi(x,y)^n$$

where $c_0,c_1,c_2,\ldots,c_n$ are polynomial coefficients for calibration, which are unique for each pixel $(x,y)$, and $n$ represents the polynomial degree.

2.2 MSUNet++

Figure 1 illustrates the proposed MSUNet++ network of incorporating multi-scale feature fusion. Built upon UNet++ [24], initially developed for medical image segmentation, MSUNet++ also addresses the limitation of UNet by balancing the emphasis on deep-level and low-level features, allowing the network to independently learn the weights of different depth features. MSUNet++ shares the same feature extractor among its different decoders, i.e., shared Encoder, like Fig. 1 (b). Different levels of UNet fail to be trained individually, but only one Encoder is trained, and different layers of features are restored by different Decoders, as shown in Fig. 1 (c). In other words, when we superimpose the four layer-increasing UNet network structures, there is overlap in the encoder parts. Therefore, we merge the overlapping portions of these four encoders to form a shared encoder, and each decoder is connected to the corresponding node of the shared encoder based on its depth. This is done to ensure that gradients effectively propagate to the intermediate layers of each sub-decoder during the training process. To achieve this, we utilize the comprehensive long-short skip connections method proposed by the original author of UNet++ to connect the corresponding nodes of different decoders. Moreover, intermediate features from the shallow UNet levels are propagated to the Decoder of deeper UNet levels, enhancing the expression capability of deep UNet low-level features. Differing from the original UNet++, we enhance MSUNet++ for our 3D task in the work by integrating output features from various UNet levels through channel-wise fusion and employing a dual-layer convolutional output for predictions. This design enables the effective integration of feature information across levels. Additionally, we introduce a multi-scale feature fusion block (MSFFB) to enhance the feature extraction and fusion capabilities of MSUNet++. Especially and innovatively, our approach incorporates DWT as a crucial improvement in input feature processing. Obviously different from Wu et al.’s approach [20], we utilize only the high-frequency components of DWT for the fringe patterns while retaining the original input, and to match the input size, the high-frequency component would be upsampled and concatenated to the input with the original fringe pattern, as shown in Fig. 1 (a). To our best knowledge, there are no works to adopt this strategy to achieve end-to-end accurate 3D reconstruction.

Fig. 1. MSUNet++ architecture overview. (a) DWT components details, $CA$ is denoted as the low-frequency component, and $CH$, $CV$, and $CD$ are respectively denoted as the high-frequency components in horizontal, vertical, and diagonal directions; (b) Shared encoder; (c) Multi-level decoders, different colors represent different levels of decoders.

Download Full Size | PDF

2.3 Multi-scale feature fusion block (MSFFB)

We construct a multi-scale feature fusion block, MSFFB (Fig. 2), to replace the simple convolutional layer feature extraction block of UNet++ in Encoder and Decoder. Drawing on inspiration from Inception [25], we employ parallel feature extraction using dilated convolutions with different dilation rates and fuse the output features from different branches through concatenation. As shown in Fig. 2 (a), MSFFB extracts features with different receptive fields at multiple branches while preserving the original input features, and merges all the branch features through concatenation. In this regard, Branch0 utilizes convolutional layers with a kernel size of $1\times 1$ to preserve the original features, while Branch1 to Branch4 incorporate convolutional layers with kernel sizes of $3\times 3$ and increased dilation rates to extract more diverse features. The convolutional layer details of the branch layers are shown in Fig. 2 (b). By utilizing dilated convolutions with different scales, not only the receptive field can be enlarged, but also the robustness of the neural network can be improved, leading to the improvement of the perception and expression capabilities of the network for different scales information in the input features. As a result, more comprehensive and rich feature representations can be obtained.

Fig. 2. MSFFB. (a) MSFFB structure, C represents the number of feature channels; (b) Branch descriptions of MSFFB.

Download Full Size | PDF

2.4 Discrete wavelet transform (DWT)

CNNs exhibit a tendency to primarily focus on the low-frequency components of images during training [26]. However, when considering robustness, it is imperative not to disregard the equally important high-frequency features [27]. Particularly in the context of fringe pattern 3D reconstruction, these high-frequency features hold significant significance. Therefore, in this paper, DWT is introduced to extract the high-frequency information of the input fringe patterns, and at the same time, the high-frequency components with original input are used as the network input to enhance the weight of the high-frequency components. As well as we know, DWT has enjoyed good properties in image processing, which is capable to decompose the image into frequency bands of various scales, so as to effectively represent the high-frequency information such as details and textures of the image. The DWT is represented as:

(5) $$CA,CH,CV,CD = DWT(I)$$

where $CA$ is denoted as the low-frequency component of input $I$, and $CH$, $CV$ and $CD$ are denoted as the high-frequency components in horizontal, vertical, and diagonal directions, respectively, as shown in Fig. 1 (a). By inputting both the original fringe patterns and the high-frequency components into the network, we achieve two objectives. On the one hand, it provides the network with the original information, reducing the noise influence on the DWT features. On the other hand, it is able to offer richer information, allowing the network to learn the high-frequency signals better and recover the details and texture information of the fringe patterns more accurately, thereby improving the accuracy and quality of the reconstruction results.

2.5 Loss function

Loss functions play a crucial role in network training, and mean absolute error (L1 Loss) and mean squared error (L2 Loss) are commonly used loss functions for regression tasks. However, both of these loss functions mainly focus on evaluating the average error, which may lead to poorer results in certain local areas.

Image quality is an important concern in the fields of image processing and computer vision. The Structural Similarity Index Measure (SSIM) [28] is an evaluation metric widely used in image processing that effectively quantifies the structural similarity of images, providing an objective assessment of image quality. SSIM evaluates image quality based on structural information and measures the similarity between images, rather than just the average error. In this study, the objective is to recover the 3D depth information from a single frame fringe pattern, which can be reflected in the depth map. Therefore, SSIM is well-suited for this task, which is defined as:

(6)$$SSIM(x,y)=\frac{(2 \mu_x \mu_y + c_1 )(2 \sigma_x \sigma_y + c_2)}{(\mu_x^2+\mu_y^2+c_1)(\sigma_x^2+\sigma_y^2+c_2)}.$$

where $\mu _x$ and $\mu _y$ are the means of the compared images $x$ and $y$ respectively, $\sigma _x$ and $\sigma _y$ are the variances of the compared images $x$ and $y$ respectively, and $\sigma _{xy}$ is the covariance between $x$ and $y$. The value of SSIM is in the range of [0,1], when the similarity between two images is higher, the SSIM score is closer to 1; on the contrary, when the difference between the images is larger, the SSIM score is close to 0. Therefore, as a loss function, SSIMLoss should be expressed as:

(7) $$Loss1 = 1-SSIM(GT, P(I)),$$

where $GT$ represents the ground truth of $I$, and $P(I)$ denotes the prediction of the neural network.

On the other hand, local details are usually present in the edge regions of the image. So we introduce an edge perception loss in the loss function to enhance the focus on edge information. The edge perception loss is expressed as the L2 loss between the edge detection results of the ground truth and the prediction:

(8) $$Loss2 = L2Loss(Edge(GT), Edge(P(I))).$$

where $Edge$ represents the edge detection function that utilizes a $3\times 3$ Laplacian kernel.

A more comprehensive and accurate definition of the loss function is achieved by weighting and summing different loss terms to combine multiple aspects of the objective. Each loss term plays a different role in the loss function, helping the network to optimise the objective and improve the performance of the model. The final expression of the loss is as follows:

(9)$$Loss = \lambda_1Loss1 + \lambda_2Loss2.$$

where $\lambda _1$ and $\lambda _2$ represent the weight coefficients of $Loss1$ and $Loss2$, respectively. And after extensive experimental validation, $\lambda _1:\lambda _2$ has been set to $10:1$ in this paper.

3. Experiments and analysis

In this section, the effectiveness and robustness of the proposed method are validated through generalization experiments and ablation experiments. The network architecture is built based on the PyTorch DL framework and deployed on an NVIDIA RTX 3090 GPU (24GB). The training process consists of 200 epochs, with a batch size of 4 and an initial learning rate of 0.0001, which is dynamically adjusted during training. In order to facilitate the comparison of the results, the normalized scale results are recovered into their original version. The evaluation metrics include root mean square error (RMSE), Structural Similarity Index Measure (SSIM), and mean relative error (MRE). MRE is a commonly used evaluation metric in depth map reconstruction to measure the error between predicted depth maps and GT depth maps. It is expressed as follows:

(10)$$MRE=\frac{1}{Q}\sum\frac{|D-GT|}{GT}*100\%$$

where $D$ represents the predicted depth map, $GT$ represents the GT depth map, $Q$ is the total number of pixels, and $\sum$ denotes the sum over all pixels. A lower MRE indicates a closer match between the predicted and GT depth maps, indicating higher accuracy in the depth map reconstruction process. The value of MRE falls within the range of $[0,1]$, to provide a more intuitive understanding of the error magnitude, facilitate comparison, and enable evaluation, and it is expressed as a percentage in this paper.

3.1 Datasets

The dataset adopted in this paper is available in Nguyen’s article [16] (the raw data can be obtained from [29]), which contains more than 1,500 pairs of fringe patterns and corresponding depth information (GT). It contains various samples, including single objects or multiple objects placed in different poses, as shown in Fig. 3. And the data is created using the traditional FPP method described in Section 2.1 above. The dataset is randomly divided into training, validation, and test sets in the ratio of 8:1:1, and all the data is normalized to the range of [0,1] by the following equation:

(11)$$I_{norm} = \frac{I - I_{min}}{I_{max}-I_{min}}$$

Whereas $I_{norm}$ represents the normalized data, $I$ denotes the original data, and $I_{max}$ and $I_{min}$ respectively refer to the maximum and minimum values among all the original data.

Fig. 3. Datasets including (a) samples including different single objects;(b) samples including two same objects with different poses; (c) Two samples with different three objects.The corresponding GTs of depth are on the right.

Download Full Size | PDF

3.2 Generalization experiments

We compare the generalization ability of our proposed method with the classical UNet [30], hNet [31], UNet-Wavelet, and hNet-Wavelet [20] on a test set of 153 pairs of data. By examining the reconstructed point clouds and reconstruction errors of single-object (Fig. 4) and multi-objects (Fig. 5) samples, it can be clearly observed that all methods can roughly reconstruct the surface contours of the tested objects. However, our proposed method demonstrates better reconstruction capability, with smoother results, fewer noise points, and a more uniform distribution of overall errors. In contrast, the other methods exhibit more noise in the reconstructed surfaces, rougher visual appearance, and noticeable errors in areas with significant boundary fluctuations and texture variations for some certain network structures, as shown in Fig. 4 (b), the outer ring portion of the reconstruction error for the hNet-Wavelet network. Based on the quantitative metric RMSE, our proposed method shows a reduction of 18.20% and 36.88% compared to the best-performing UNet in two different samples presented in Fig. 4 and Fig. 5, respectively. The MRE also shows a reduction of 67.83% and 68.18%, which indicates that the reconstruction results of the proposed method are closer to the GT. However, the SSIM values are consistently high, above 0.988, for both methods. Comparing the cross-lines c1 and c2 marked in sub-figures (b), it is evident from sub-figures (c) in both mentioned figures that the proposed method demonstrates smoother reconstruction results in contrast to other methods, which exhibit more jagged and rough surfaces. This serves as further evidence of the robustness of our proposed method.

Fig. 4. Reconstruction results of a single object. (a) Point cloud, where the units of RMSE and MRE are mm and 100%, respectively; (b) Reconstruction error relative to GT; (c) Crossed-lines of 3D contour reconstructed via different methods marked in the Fringe Pattern in (b) with c1 and c2.

Download Full Size | PDF

Fig. 5. Reconstruction results of multiple objects. (a) Point cloud, and the units for RMSE and MRE are mm and 100% respectively; (b) Reconstruction error relative to GT; (c) Crossed-lines of 3D contour reconstructed via different methods marked in the Fringe Pattern in (b) with c1 and c2.

Download Full Size | PDF

The quality of detail reconstruction is one of the key focuses of FPP techniques. Figure 6 shows the reconstruction results of different methods on the mouth area of an animal plaster model with significant variations. It is evident that our proposed method better preserves the original detailed features in that area, including textures, curves, etc., resulting in a more refined visual appearance compared to other methods. On the other hand, other methods exhibit shortcomings in detail reconstruction, failing to fully retain the original detailed features and displaying blurriness and distortion in the reconstruction results. Therefore, our proposed method demonstrates a significant advantage over other compared methods in terms of geometry shape details reconstruction.

Fig. 6. Comparison of details. The enlarged sub-images within red circles show the local 3D contours for geometry details demonstration.

Download Full Size | PDF

In addition, we establish a quantitative analysis of RMSE, SSIM, and MRE on the entire test dataset, and the specific results are shown in Table 1. The statistical results clearly indicate that our method outperforms others in all metrics. Compared to the best-performing method, UNet (RMSE:1.47339mm, MRE:3.76670%), our approach (RMSE:1.17604mm, MRE:1.16284%) achieves a reduction of 20.18% in RMSE and 69.13% in MRE, which indicates that our results have higher accuracy and lower deviation, thus our method is more reliable. Regarding the SSIM metric, all methods yield results exceeding 0.98, implying that they are capable of approximating the overall shape of the tested objects. This conclusion is also supported by Fig. 4 and Fig. 5. And From the performance curves of all 153 data pairs in the test set shown in Fig. 7, it can be observed that the proposed method exhibits a clear advantage.

Fig. 7. Comparison of all metrics on the test set. (a) RMSE; (b) SSIM; (c) MRE.

Download Full Size | PDF

Table 1. Performance comparison of different methods on the entire test dataset including 153 pairs of data

View Table | View all tables in this article

Considering that MSUNet++ employs a multi-branch connection structure and the MSFFB module utilizes a multi-branch calculation structure, our network does not hold a competitive edge in terms of time efficiency. In Table 2, we provide data on the average training time per epoch and the testing time per individual sample, as compared to other contrastive network models. It can be observed that, when compared to the top-performing comparative network model UNet, our training time is 3.745 times that of UNet, and the testing time is 8.041 times longer. However, based on the quality of the reconstruction results, we deem this trade-off in time to be acceptable.

Table 2. Time Consumption Comparison. The training time per epoch and the testing time for a single sample

View Table | View all tables in this article

3.3 Ablation experiments

We implement the following ablation experiments to validate the effectiveness of our proposed method. Our network is decomposed into a couple of sub-models (A, B, and C) as shown in Table 3, where all of models A, B, and C adopt the L2 loss function.

Table 3. Configuration of ablation experiments

View Table | View all tables in this article

Figure 8 displays the reconstruction results of different models for a particular sample in the test set. It can be found that the reconstruction quality gradually improves with the addition of modules, and the 3D surface gradually becomes smoother. Especially in the labeled notch region, it gradually becomes clear and crisp from the blurriness of model A, which verifies the effectiveness of the proposed module.

Fig. 8. Reconstruction results of different sub-models. (a) Point cloud, and the units for RMSE and MRE are mm and 100%, respectively; (b) Reconstruction error relative to GT; (c) Crossed-lines of 3D contour reconstructed via different methods marked in the Fringe Pattern in (b) with c1 and c2.

Download Full Size | PDF

From the results of the metrics for the whole test set (Table 4), a tiny change in the SSIM metric indicates that all models successfully reconstruct the approximate 3D surface shape. However, when the MSFFB module is introduced in Model B, compared to Model A, there is a remarkable reduction of 21.62% in RMSE and 36.16% in MRE, demonstrating the excellent feature extraction capability of MSFFB. The inclusion of DWT features also leads to an obvious improvement in MRE. Although the improvement is relatively lesser compared to MSFFB, there is a reduction of 4.18% in RMSE and 20.32% in MRE for Model C, compared to Model B. After incorporating the proposed new loss function, the MSUNet++ model leads to the reduction of the RMSE by 5.64%, but the MRE by 59.28%, compared to Model C. This shows that the predictions of MSUNet++ are closer to the GT, enjoying the superior performance of the proposed loss function in this task.

Table 4. Comparison of different ablation models

View Table | View all tables in this article

Furthermore, with the same loss function, our Model C (RMSE:1.24635mm, MRE:2.85584%) also outperforms the previously mentioned UNet (RMSE:1.47339mm, MRE:3.76670%) in terms of metrics on the test set, which further confirms the effectiveness of the proposed network architecture.

4. Conclusions

In this paper, a DL 3D depth reconstruction method based on DWT and multi-scale feature fusion is introduced with the aim of improving the reconstruction accuracy of single-frame fringe patterns to 3D surfaces. By fully utilizing the multi-level feature extraction and fusion capabilities of UNet++, and incorporating a multi-scale feature extraction and fusion module MSFFB, the ability of feature extraction and fusion is effectively enhanced, resulting in higher reconstruction accuracy. Additionally, in the data preprocessing stage, we employ the DWT method to extract the high-frequency signals of the fringe patterns, thus enhancing the impact of high-frequency signals on the reconstruction results. Furthermore, the proposed structural similarity and edge perception loss functions significantly contribute to the leap in reconstruction quality. Through comparative experiments with other methods, we have validated the excellent performance of the proposed model with higher reconstruction accuracy compared to other methods for single-frame fringe-to-3D surface reconstruction. However, despite these strengths, our method still has some limitations. Firstly, our network model has a large number of parameters, with a total of 36.2590 million, which is only 5.03% more than the tested UNet (34.5240 million) model. However, the reconstruction results show a reduction of 20.18% in RMSE and 69.13% in MRE compared to UNet. Additionally, the multi-branch MSFFB module and UNet++ itself involve the storage of multiple intermediate results, resulting in high memory usage. Moreover, compared to other methods, the training speed of this multi-branch structure is relatively slow. On the whole, the research method in this work is a good reference for improving DL-based FPP methods. In future research, we are expected to further optimise and advance the application of the method and contribute to the development of related fields.

Funding

National Natural Science Foundation of China (62101364,61901287); Sichuan Provincial Central Guidance Local Science and Technology Development Project (2022ZYD0111); Key Research and Development Program of Sichuan Province (2021YFG0195, 2022YFG0053); China Postdoctoral Science Foundation (2021M692260).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [29].

References

1. P. Zhang, K. Zhong, L. Zhongwei, et al., “High dynamic range 3D measurement based on structured light: a review,” J. Adv. Manuf. Sci. Technol. 1(2), 2021004 (2021). [CrossRef]

2. S. Van der Jeught and J. J. Dirckx, “Real-time structured light profilometry: a review,” Opt. Lasers Eng. 87, 18–31 (2016). [CrossRef]

3. S. Zhang, “High-speed 3d shape measurement with structured light methods: A review,” Opt. Lasers Eng. 106, 119–131 (2018). [CrossRef]

4. Y. Liu, Y. Fu, Y. Zhuan, et al., “High dynamic range real-time 3D measurement based on fourier transform profilometry,” Opt. Laser Technol. 138, 106833 (2021). [CrossRef]

5. H. Nguyen, J. Liang, Y. Wang, et al., “Accuracy assessment of fringe projection profilometry and digital image correlation techniques for three-dimensional shape measurements,” JPhys Photonics 3(1), 014004 (2021). [CrossRef]

6. P. Zhou, X. Feng, J. Luo, et al., “Temporal-spatial binary encoding method based on dynamic threshold optimization for 3D shape measurement,” Opt. Express 31(14), 23274–23293 (2023). [CrossRef]

7. P. Zhou, Y. Cheng, J. Zhu, et al., “High-dynamic-range 3-D shape measurement with adaptive speckle projection through segmentation-based mapping,” IEEE Trans. Instrum. Meas. 72, 1–12 (2023). [CrossRef]

8. C. Zuo, S. Feng, L. Huang, et al., “Phase shifting algorithms for fringe projection profilometry: a review,” Opt. Lasers Eng. 109, 23–59 (2018). [CrossRef]

9. X. Su and W. Chen, “Fourier transform profilometry: a review,” Opt. Lasers Eng. 35(5), 263–284 (2001). [CrossRef]

10. C. Zuo, J. Qian, S. Feng, et al., “Deep learning in optical metrology: a review,” Light: Sci. Appl. 11(1), 39 (2022). [CrossRef]

11. Y. Rivenson, Y. Zhang, H. Günaydın, et al., “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light: Sci. Appl. 7(2), 17141 (2017). [CrossRef]

12. W. Yin, Y. Hu, S. Feng, et al., “Single-shot 3D shape measurement using an end-to-end stereo matching network for speckle projection profilometry,” Opt. Express 29(9), 13388–13407 (2021). [CrossRef]

13. R. Wang, P. Zhou, and J. Zhu, “Accurate 3D reconstruction of single-frame speckle-encoded textureless surfaces based on densely connected stereo matching network,” Opt. Express 31(9), 14048–14067 (2023). [CrossRef]

14. J. Shi, X. Zhu, H. Wang, et al., “Label enhanced and patch based deep learning for phase retrieval from single frame fringe pattern in fringe projection 3D measurement,” Opt. Express 27(20), 28929–28943 (2019). [CrossRef]

15. S. Feng, Q. Chen, G. Gu, et al., “Fringe pattern analysis using deep learning,” Adv. Photonics 1(02), 025001 (2019). [CrossRef]

16. A.-H. Nguyen, O. Rees, and Z. Wang, “Learning-based 3D imaging from single structured-light image,” Graph. Model. 126, 101171 (2023). [CrossRef]

17. S. Van der Jeught and J. J. Dirckx, “Deep neural networks for single shot structured light profilometry,” Opt. Express 27(12), 17091–17101 (2019). [CrossRef]

18. X. Cheng, Y. Tang, K. Yang, et al., “Single-exposure height-recovery structured illumination microscopy based on deep learning,” Opt. Lett. 47(15), 3832–3835 (2022). [CrossRef]

19. F. Wang, C. Wang, and Q. Guan, “Single-shot fringe projection profilometry based on deep learning and computer graphics,” Opt. Express 29(6), 8024–8040 (2021). [CrossRef]

20. X. Zhu, Z. Han, L. Song, et al., “Wavelet based deep learning for depth estimation from single fringe pattern of fringe projection profilometry,” Optoelectron Lett. 18(11), 699–704 (2022). [CrossRef]

21. L. Zhang, Q. Chen, C. Zuo, et al., “Real-time high dynamic range 3D measurement using fringe projection,” Opt. Express 28(17), 24363–24378 (2020). [CrossRef]

22. C. Zuo, L. Huang, M. Zhang, et al., “Temporal phase unwrapping algorithms for fringe projection profilometry: a comparative review,” Opt. Lasers Eng. 85, 84–103 (2016). [CrossRef]

23. S. Feng, C. Zuo, L. Zhang, et al., “Calibration of fringe projection profilometry: A comparative review,” Opt. Lasers Eng. 143, 106622 (2021). [CrossRef]

24. Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, et al., “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, (Springer, 2018), pp. 3–11.

25. C. Szegedy, W. Liu, Y. Jia, et al., “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2015), pp. 1–9.

26. Zhi-Qin John X, J. Xu, Y. Zhang, et al., “Training behavior of deep neural network in frequency domain,” in Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26, (Springer, 2019), pp. 264–274.

27. D. Yin, R. Gontijo Lopes, J. Shlens, et al., “A fourier perspective on model robustness in computer vision,” Advances in Neural Information Processing Systems32, 1 (2019).

28. Z. Wang, A. C. Bovik, H. R. Sheikh, et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

29. A.-H. Nguyen, O. Rees, and Z. Wang, “Single-input dual-output 3D shape reconstruction, figshare (2023), https://figshare.com/s/c09f17ba357d040331e4

30. O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, (Springer, 2015), pp. 234–241.

31. H. Nguyen, K. L. Ly, T. Tran, et al., “hNet: single-shot 3D shape reconstruction using structured light and h-shaped global guidance network,” Results in Optics 4, 100104 (2021). [CrossRef]

	UNet	hNet	UNet-Wavelet	hNet-Wavelet	Ours (MSUNet++)
RMSE( $m m$ )	1.47339	1.85102	1.55353	1.82608	1.17604
SSIM	0.99540	0.98802	0.99393	0.98976	0.99799
MRE(%)	3.76670	9.27177	4.41034	5.90320	1.16284

	UNet	hNet	UNet-Wavelet	hNet-Wavelet	Ours (MSUNet++)
Train Time( $s$ )	64.0	31.5	10.8	17.1	239.7
Test Time( $m s$ )	3.69	4.96	5.37	9.09	29.67

	MSFFB	DWT	Our Loss (Eq.(5))
A
B	✓
C	✓	✓
MSUNet++	✓	✓	✓

	A	B	C	MSUNet++
RMSE( $m m$ )	1.65964	1.30076	1.24635	1.17604
SSIM	0.99099	0.99512	0.99670	0.99799
MRE(%)	5.61406	3.58415	2.85584	1.16284

	UNet	hNet	UNet-Wavelet	hNet-Wavelet	Ours (MSUNet++)
RMSE( $m m$ )	1.47339	1.85102	1.55353	1.82608	1.17604
SSIM	0.99540	0.98802	0.99393	0.98976	0.99799
MRE(%)	3.76670	9.27177	4.41034	5.90320	1.16284

Deep learning-based end-to-end 3D depth recovery from a single-frame fringe pattern with the MSUNet++ network

Abstract

1. Introduction

2. Methodology

2.1 Fringe projection profilometry (FPP)

2.2 MSUNet++

2.3 Multi-scale feature fusion block (MSFFB)

2.4 Discrete wavelet transform (DWT)

2.5 Loss function

3. Experiments and analysis

3.1 Datasets

3.2 Generalization experiments

3.3 Ablation experiments

4. Conclusions

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (8)

Tables (4)

Equations (11)

Optics Express