Single-shot fringe projection profilometry based on deep learning and computer graphics

Fanzhou Wang; Fanzhou Wang; Fanzhou Wang; Chenxing Wang; Chenxing Wang; Chenxing Wang; Qingze Guan

doi:10.1364/OE.418430

1. Introduction

Fringe projection profilometry (FPP) is a classic solution for 3D shape scanning. It projects coded fringes to an object and captures the deformed fringe images modulated by the object’s surface. Then the 3D shape is reconstructed by demodulating the fringe signals, and the 3D point cloud is obtained through further calibration algorithms. Although FPP has been applied to multiple scenarios [1–4], it still faces the difficulty to balance the accuracy and speed. N-step phase-shifting algorithm [5] is precise and commonly used, but it requires projecting and capturing more than three different fringe images. This process is time-consuming for dynamic measurements. With a single-shot fringe image, Fourier-transform profilometry (FTP) [6] extracts the carrier spectra for 3D shape reconstruction. Unfortunately, its accuracy would be affected by spectra overlaps when processing complex shapes. To solve this problem, windowed Fourier transform [7], wavelet transform [8] such like spectra analysis methods are introduced, but they need heavy calculations and presetting parameters. In addition, the methods above usually obtain the phase wrapped into a range of 2π, thus unwrapping algorithms are needed to further restore the true 3D shapes. Similarly, temporal phase unwrapping methods, such as gray-code method [9,10] and multi-frequency method [11,12], are simple to calculate, but they require projecting additional multiple fringe images. Spatial phase unwrapping methods, such as branch-cut method [13], flood method [14], and Laplacian operator method [15], can perform unwrapping from a single image, but they need massive calculation and the accuracy is sensitive to the noise, shadow, or height jump.

In recent years, deep learning presents powerful performance with the improvement of the neural network structure and the computing power. Plenty of studies have proved that deep learning performs superior to traditional algorithms in terms of speed and robustness, which are used for fringe denoising [16–18], fringe analysis [19,20], and phase unwrapping [21–23]. However, these works conduct experiments with limited datasets and only focus on a single step of the FPP system, which means that multiple networks must be integrated to construct a complete system. Then the training process and the preparation of training sets are troublesome for integrating networks, and such integration inevitably accumulates errors. Refreshingly, some researches directly map a single fringe image to its height/depth image with a single network [24–27]. Furthermore, some explorations to simulate training samples are also conducted to improve the performance of the trained model. In [24], a large number of pairs of fringe pattern and height map are simulated with mathematical expressions, obviously, this approach is difficult to generate data close to reality. Naturally, to simulate an FPP system becomes an optimal solution for generating data conveniently [25,26]. In [26], a digital twin of a real FPP system is even proposed innovatively. It strictly copies the calibration parameters of a real system to build a virtual system for rendering dataset, which enables the trained network model to be used in this specific real system. However, many other factors in reality cannot be copied accurately, such as the measuring environment or the property of objects and etc., which also affect the accuracy of the model.

The selection of suitable network and parameters is another major issue for deep learning methods. Existing works used in FPP basically choose the convolutional neural networks (CNNs). For instance, an optical fringe pattern de-noising convolutional neural network (FPD-CNN) model is proposed in [18]; a CNN model for wrapped-phase calculation is proposed in [20]; the U-Net [28] is improved for phase-unwrapping in [23]. To guide the selection of a suitable network for the single-shot FPP, a comparison of three CNNs is conducted in [27], including fully convolutional networks (FCN), Autoencoder networks (AEN) and U-Net, and U-Net is concluded performing the best due to its symmetric structure and feature map concatenation. In fact, except for CNN, other network models also emerge with powerful performance recently, such as the adversarial generative network (GAN) [29], which generates data in a certain style through an adversarial procedure. Among GANs, pix2pix [30] establishes the conversion from an image to an image and shows excellent performance in generating images with details.

With in-depth research, we realized that the simulation of FPP systems is significant and simulating diverse interference factors from reality to the virtual FPP system is necessary. Therefore, in this paper, the methods to construct the virtual system are given in detail, and furthermore, variable factors interfering with the FPP system in reality and the way of mapping these factors to the virtual systems are researched thoroughly. With different combinations of these factors being set, our simulated virtual FPP system renders different large datasets that are investigated and evaluated their influence on the accuracy and generalization of the network. In addition, a new loss function is designed, which considers the structure similarity of objects and the detail information to improve the overall and detailed accuracy of the result. U-Net and pix2pix, the representatives of CNNs and GANs respectively, are compared by multiple experiments to explore the better solution for estimating the depth image. The real experiment further verifies the accuracy and the generalization ability of our method.

2. Construction of a virtual FPP system and the rendering of datasets

Sufficient training data are the guarantee of excellent performance for deep learning networks. Recently, computer graphics has been successfully introduced for dataset generation [31–33]. For FPP technique, the graphic software has even been applied for simulating a system to establish diverse datasets conveniently [25,26]. This section introduces the details of constructing a virtual FPP system and rendering data samples.

2.1 Selection of 3D models

The virtual objects used in the virtual FPP system can be selected from existing 3D model datasets, such as ModelNet [34], ShapeNet [35], ABC [36], Thingi10K [37], etc. Considering the effective working distance of FPP in visible light (within 1∼2m), we select the Thingi10K dataset that contains various 3D models of common objects, such as sculptures, vases, and dolls, as shown in Fig. 1. The variety and the magnitude of these models help to generate large-scale and diverse data samples as needed.

Fig. 1. Some models from Thingi10 K.

Download Full Size | PDF

2.2. Construction of a virtual FPP system

Computer graphics is good at presenting the real-world scene in a virtual form. Among various graphics software, Blender is an open-source 3D creation suite, which is powerful and can generate images by Python in batch. In Blender, a virtual camera and a virtual projector can be placed in the “Layout”, as shown in Figs. 2(a) and 2(b). The virtual system works the same as a real FPP system, i.e., the projector projects sinusoidal fringes onto an object, and the deformed fringes are captured by a camera. Blender renders fringe images by setting the compositing node “Render Layers” to “Image”, and renders depth images by setting it to “Depth” [shown in Fig. 2(c)].

Fig. 2. (a) Aerial view of the scene layout in Blender; (b) side view of the scene layout in Blender; (c) the compositing node tree of this blender system.

Download Full Size | PDF

Some elements of our virtual FPP system in Blender include:

1) Camera: the type is set to “perspective”, and its position and rotation angle can be adjusted;
2) Projector: it is modeled as a point light source, and its shading node tree is designed as Fig. 3 to project parallel sinusoidal fringes, where each node is explained in Appendix A;
3) Objects: 3D models are loaded in and are scaled to a proper size;
4) Background: some indoor environment maps can be imported into the “World” setting (the effect is shown in Fig. 4), and the shading node tree of “World” is set as Fig. 5, where the rotation angle and the brightness of the background can be randomly changed;
5) Rendering: the rendering engine is set as the physically-based path tracer “Cycles”, and the sampling integrator is set as “Branched path tracing”.
6) File format: any common image format is permitted for fringe images, but the depth images should be saved in Open_EXR format to retain the original depth information.

Fig. 3. The shading node tree of constructing a projector.

Download Full Size | PDF

Fig. 4. (a) An HDRI environment map [38]; (b) the side view of the scene layout after importing an environment map; (c) the camera view after importing an environment map.

Download Full Size | PDF

Fig. 5. The shading node tree of “World” [39].

Download Full Size | PDF

2.3 Factors enhancing the reality of the virtual FPP system

To enhance the generalization of our network, not only the 3D models to construct the training set should be rich and diverse, the settings of the virtual system also have to be adjusted to accord with various possible measurement environments. The input of our network is a sinusoidal fringe image with a usual mathematical description as

(1)$$I(x,y) = a(x,y)\cos [2\pi fx + \varphi (x,y)] + b(x,y) + n(x,y),$$

where a(x, y) is the amplitude intensity, f is the frequency deciding the fringe period, φ(x, y) is the phase describing the shape of an object, b(x, y) is the background, and n(x, y) is the noise. These parameters are changed in different measurements, leading to the change of fringe images and the consequent change of output depth images. Thus the main factors influencing these parameters in practice should be taken into consideration in the virtual system settings, and they are analyzed as follows.

2.3.1 Period of fringes

According to the classical calibration theory [40], the image coordinate [x, y] is described as

(2)$$\left\{ {\begin{array}{c} {x = {f_x}\frac{X}{Z} + {c_x}}\\ {y = {f_y}\frac{Y}{Z} + {c_y}} \end{array}} \right.,$$

where [f_x, f_y] are the focal lengths of camera (projector), [X, Y, Z] denote the camera (projector) coordinate and [c_x, c_y] are the optical centers of camera (projector).

In practice, the optical centers and focal lengths vary from different devices. The difference of optical centers leads to different imaging locations for the object in an image, which will not cause errors for the task of this paper. However, the changes of focal lengths would cause the fringe period of the captured fringe image to be zoomed, which may correspond to the change of f or φ(x, y) in Eq. (1). This type of change influences the depth extraction and so is necessary to be considered. In the virtual FPP system, this can be simulated by setting various periods of the projected fringes (adjusting the “scale-X” in the 2nd “mapping” node in Fig. 3).

2.3.2 Pose between the camera and the projector

The space geometry relation between the camera coordinate [X_c, Y_c, Z_c] and the projector coordinate [X_p, Y_p, Z_p] can be described as

(3)$$\left[ {\begin{array}{c} {{X_c}}\\ {Y{}_c}\\ {{Z_c}} \end{array}} \right] = R\left[ {\begin{array}{c} {{X_p}}\\ {{Y_p}}\\ {{Z_p}} \end{array}} \right] + t.$$

where R and t denote the rotation and the translation matrixes, respectively. To simulate this relationship, we rotate the projected fringes around the optical axis of the projector (by adjusting the “rotation-Z” in the 2nd “mapping” node in Fig. 3) and set different angles between the optical axes of the camera and the projector to simulate different R and t.

2.3.3 Amplitude intensity and background

The amplitude intensity of a fringe image, corresponding to a(x, y) in Eq. (1), is generally decided by the material/texture of objects, the power of projector, and the brightness of background, which can be set conveniently in the virtual FPP system (by adjusting the “strength” in the “background” node in Fig. 5). And the environment map can be shifted or rotated multiple times to simulate that the objects are located in different backgrounds (by adjusting the “rotation-Z” in the “mapping” node in Fig. 5).

With the factors above adjusted, the virtual FPP system can be set to generate the data much close to the ones from a practical system, which thus helps to improve the practicability of the trained network.

3. Networks and the designed loss function

U-Net has been proved the best compared with some other CNNs [27] used for FPP techniques. In Recent years, GAN reveals powerful ability in image generation, so it is explored whether it performs better on the depth estimation in this paper. Below the architectures of the U-Net and a conditional GAN (cGAN) named pix2pix [30] are introduced simply, and the design of our new loss function is also explained.

3.1 Network architecture

3.1.1 U-Net

U-Net [27] follows an encoder-decoder structure. The encoder down-samples the input images to extract features, and the decoder up-samples the feature maps to obtain a high-resolution output image. U-Net also has a special structure of skip-connection so that larger-scale feature maps can be directly sent to the up-sampling process, and therefore, the output process and input process share the low-level information. Based on these structures, U-Net learns with less data but achieves higher precision. As U-Net is a part of pix2pix, its structure is given as follows.

3.1.2 pix2pix

pix2pix [30] contains a generator and a discriminator. The generator produces fake images, and the discriminator tries to identify the fake ones, guiding the generator to produce a fake image much closer to the target output. Figure 6 presents the architecture of pix2pix. The network of pix2pix is shown in Fig. 7, where the generator has a U-Net shape and the discriminator is “patchGAN”, a multi-layer CNN.

Fig. 6. The architecture of pix2pix

Download Full Size | PDF

Fig. 7. The network structure of pix2pix, including the structure of U-Net.

Download Full Size | PDF

3.2 Proposed new loss function

A loss function defines the convergence form of a network and so is the key to the quality of outputs. The mean absolute error (L1 loss) and the mean square error (L2 loss) are the most commonly used loss functions. However, any of the two only evaluates average errors, thus, the resulted outputs quite possibly show low quality in some local regions.

The task in this paper is to retrieve a depth image that records the 3D shape of a scanned object; hence the geometry and the spatial structure is a good constraint to keep the overall effect of the outputs. An index called Structure SIMilarity (SSIM) [41] leverages the structural information to evaluate image quality, which is defined as

(4)$${\textrm{SSIM}} ({u,v} )= \frac{{({2{\mu_u}{\mu_v} + {c_1}} )({2{\sigma_{uv}} + {c_2}} )}}{{({{\mu_u}^2 + {\mu_v}^2 + {c_1}} )({{\sigma_u}^2 + {\sigma_v}^2 + {c_2}} )}},$$

where μ_u is the mean of the evaluated image u, μ_v is the mean of the ground truth v, σ_u and σ_v are the variances of u and v, respectively, σ_uv${\; }$is the covariance of u and v, and c₁ and c₂ are two constants to avoid division by zero. The SSIM ranges in [0, 1], and it is scored low if the evaluated image is compressed, blurred or noise contaminated. With this good ability to measure overall structure of 3D shapes, we take the index SSIM as a term of the loss function. The detailed description of this term is

(5)$${L_{T1}} = 1 - {\textrm{SSIM}} ({\rm G}(I),d),$$

where I is an input fringe image, G(I) is the fake depth image generated by U-Net or pix2pix’s generator, and d is the ground truth of G(I).

With the overall accuracy ensured, the local detail is another essential factor to the restoration of a 3D shape. As details are always embedded in the edges of an image, we add another term involving a common tool for edge detection, the Laplacian operator, to the loss function to estimate the detail’s similarity between G(I) and d. The added term is described as

(6)$${L_{T2}} = {||{{\rm{La}} ({\rm G} (I)), {\rm{La}} (d)} ||_1},$$

where La($\cdot$) denotes convolving an image by the Laplacian operator. With this term added, the network is more sensitive to slight variations of depth, then not only the accuracy of details is improved but also the abrupt artifacts in the generated depth images are eliminated.

Based on the above analysis, for the U-Net, we replace the commonly used L1 loss or L2 loss with our new loss function below:

(7)$${L_{U - Net}} = {\lambda _1}{L_{T1}} + {\lambda _2}{L_{T2}},$$

where L_T₁ and L_T₂ have been given in Eqs. (5) and (6), respectively, and λ₁ and λ₂ are the adjustable weights.

For the pix2pix, we define the loss function as

(8)$${L_{pix2pix}} = {L_{cGAN}} + {\lambda _1}{L_{T1}} + {\lambda _2}{L_{T2}}. $$

In Eq. (8), the last two terms are the same to the ones in Eq. (7), and L_cGAN is a unique term for cGAN to assess the accuracy of the discriminator’s output by

(9)$${L_{cGAN}} = \frac{1}{2}{||{{\rm D} (I,{\rm G}(I))} ||_2} + \frac{1}{2}{||{1 - {\rm D} (I,d)} ||_2},$$

where D(·) denotes the estimation result of the discriminator that distinguishes between the fake depth image G(I) and its ground truth d by comparing their relationships to the input fringe image I. Note that the L2 loss is exploited in Eq. (9) to replace the cross-entropy loss used in original pix2pix [30] since it shows better performance in improving the quality of the result and the stability of the training process [42].

4. Experiments

4.1 Dataset rendering and data preprocessing

In this paper, we choose 624 models from Thingi10K, which covers a rich variety of items with various complexities. To ensure the generalization of the trained model, we separate the 624 models into 13 groups and set different rendering parameters analyzed in section 2.3 for each group to simulate the possible various situations in practice. The variation range we set for each parameter in Blender is shown in Table 1 and the rendered fringe images varying with different parameters are displayed in Fig. 8.

Fig. 8. Fringe images rendered by adjusting the parameters in Table 1.

Download Full Size | PDF

Table 1. The variation range for each parameter

View Table | View all tables in this article

To enrich the dataset, we render multiple images for each model. In the camera coordinate system shown in Fig. 9, the model is firstly rotated around the y-axis by 12 times with each time 30°, and then for each rotation around the y-axis, another 12 times of rotation are repeated around the z-axis by 5° each time. Thus, there are 144 fringe images rendered for each object, and a depth image is rendered corresponding to each fringe image. In total, 89856 pairs of images are obtained to create the dataset. We randomly allocate the 624 models to the training and test sets in a ratio 8.5:1.5, and so there are no identical objects in the training and test sets.

Fig. 9. The rotations of objects. First row: rotation around the y-axis of camera coordinate system; second row: rotation around the z-axis of camera coordinate system.

Download Full Size | PDF

Before training, each fringe image and depth image I is normalized to [-1, 1] by

(10)$${I_n} = \frac{{{I_{n1}} - 0.5}}{{0.5}}.{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( {{I_{n1}}\textrm{ = }\frac{{I\textrm{ - min(}I\textrm{)}}}{{\textrm{max(}I\textrm{) - min(}I\textrm{)}}}} \right)$$

4.2 Comparison of different loss functions

We implement ablation experiments to compare the effect of different loss functions and their combinations based on U-Net and pix2pix. During the training process, we use Adam optimizer with momentum parameters β₁=0.5 and β₂=0.999. The batch size is 4, with a learning rate of 0.0003. The size of the SSIM window is 8. All the networks with different loss functions are trained for 13 epochs, which is enough for convergence. We set λ₁ and λ₂ in Eq. (7) and Eq. (8) as 100 and 10, respectively, which are the best empirical values.

Figure 10 illustrates the qualitative comparison of different loss functions. All fringe images in Fig. 10 are chosen from the test set, in which the objects have not been seen by the network during training. Figures 10(a) and 10(b) represent the rendered fringe images and depth images (ground truth) respectively. Figures 10(c)–10(f) are the results of U-Net with the proposed loss SSIM + Laplace, the loss with only SSIM term, only L1 loss and only L2 loss respectively. Similarly, Figs. 10(g)–10(j) show the corresponding results using pix2pix. No matter for U-Net or for pix2pix, our proposed loss function performs the best in eliminating the artifacts (the red boxes in the first two rows) and keeping details (the red boxes in the last two rows). To display clearly, Fig. 11 gives the amplified results in the red boxes for the U-Net in Fig. 10.

Fig. 10. The results of ablation experiments by U-Net and pix2pix with different loss functions respectively.

Download Full Size | PDF

Fig. 11. The amplifications for some of the local parts in Fig. 10.

Download Full Size | PDF

To quantify the effect of the results, we further compute the mean absolute error (MAE) and the mean standard deviation of errors (MSDE) for all the results in the test sets, as listed in Table 2. The results in Table 2 accord with Fig. 10 basically. They all prove that our proposed loss function is much effective. Figure 10 and Table 2 also show that U-Net generally performs better than pix2pix both in quality and in quantity.

Table 2. The quantitative metrics of different loss functions

View Table | View all tables in this article

Due to the best accuracy, our loss function SSIM + Laplace is adopted for the following experiments.

4.3 Relationship of the generalization ability and the accuracy to the diversity level of the dataset

In this section, we explore the impact of different rendering parameters on the accuracy of the depth images generated by the network. With the models separated into 13 groups, the datasets are rendered in three different cases:

D₁: the parameters in Table 1 are all the same for rendering all groups of images;
D₂: the first three parameters in Table 1 are different from group to group;
D₃: all the parameters in Table 1 vary among different groups.

Appendix B gives the other common settings of D₁, D₂, and D₃. Each dataset above is divided into a training set and a test set by 8.5:1.5 and then U-Net and pix2pix are trained by D₁, D₂, and D₃ separately.

Figure 12 illustrates the depth images generated from a real captured fringe image by U-Net and pix2pix trained by D₁, D₂, and D₃, respectively, where the fringe image is captured by an arbitrary FPP system under arbitrary indoor lighting, and the object “Venus” has not appeared in any training datasets. It is obvious that the results are very poor and even cannot be used if the factors representing the measurement environments are not considered (in D₁ and D₂). Therefore, the generalization ability of the network shows much better if more system variables and environment variables are considered when rendering the dataset. Table 3 records the MAE and MSDE of the test set in each dataset. It shows that for both U-Net and pix2pix, the accuracy decreases as the complexity of the dataset increases.

Fig. 12. The results of a real fringe image by networks trained by D₁, D₂ and D₃ respectively.

Download Full Size | PDF

Table 3. The MAE and SDE of the test set in each dataset

View Table | View all tables in this article

4.4 Other unique interference factors in practical use

In section 2.3, we analyze the common factors in reality that may influence the performance of the network model, based on the components of a fringe image expressed in Eq. (1). These factors are recommended to be considered for rendering the dataset by the virtual system. In this section, we investigate other factors from the view of the measured objects. One of the factors considered is that objects may have various colorful surfaces and textures, the other one is that objects may be placed casually and multi isolated objects are captured together. The examples of the two cases are displayed as Fig. 13. With these two factors, we create another two datasets:

D₄: images rendered with colorful 3D models in ShapeNet and images in D₃;
D₅: images rendered with multi objects and images with colorful 3D models in D₄.

Fig. 13. The samples with: (a)-(b) an object having colorful surface; (c)-(d): multi objects.

Download Full Size | PDF

For dataset D₄, 320 models in ShapeNet [35] are selected and allocated to the training set and the test set by the ratio of 8.5:1.5, so the objects in the test set are totally not identical with the ones in the training dataset. Then each model is rotated 144 times, and 46,080 new image pairs are rendered by setting with different parameters in Table 1. These images and the images in D₃ form the dataset D₄.

For dataset D₅, 624 objects selected from Thingi10K form 312 multi-object pairs. We allocate the 312 multi-object pairs to the training set and test set by the ratio 8.5:1.5, in order that the tested objects have not ever appeared in the training dataset. Then each multi-object pair in the training set is rendered 144 times according to Table 1 and the setups of D_3. The new images and the images rendered by ShapeNet in D₄ form D₅.

The testing results of the model trained by D₃ and D₄ are shown in Fig. 14, where the objects are casually taken from daily necessities. Similar to Fig. 12, the comparison in Fig. 14 illustrates that the dataset rendered by the virtual system had better consider the characteristics of the real application scenarios (such as the color of the objects). Also, the depth images in Fig. 14 generated by pix2pix present more obvious strip-like artifacts, illustrating that the U-Net still performs better than pix2pix. Furthermore, we evaluate MAE and MSDE of the results from U-Net on D₄, which are 0.0230 and 0.0663, respectively. This result maintains good accuracy.

Fig. 14. Different output depth images result from Different datasets.

Download Full Size | PDF

In Fig. 15, we use a real fringe image to test the U-Net trained by D₅. To quantify the overall error, we also compute the MAE and MSDE of the test set in D₅, and they are 0.0153 and 0.0629, respectively. The results are satisfactory both visually and quantitatively.

Fig. 15. The testing effect with the real fringe images. (a) Fringe image; (b) The depth image generated by the network.

Download Full Size | PDF

5. Discussion

5.1 Comparison between U-Net and pix2pix

The comparisons from the sections 4.2 to 4.4 all illustrate that U-Net performs better than pix2pix. The main reason is that U-Net is much adept at extracting features and then predicting an output, while pix2pix is better at synthesizing an image with reasonable complex patterns from a simple image. For the task of this paper, the depth is embedded in the fringe and there is a mapping between the fringe and depth. Consequently, U-Net is more suitable to be used in this paper.

5.2 Generalization ability of deep learning

The purpose of simulating a virtual FPP system is to conveniently generate a large-scale diverse data close to the reality. This is also the reason why we researched the factors interfering with the FPP systems in real applications. Only if the data are more diverse, the generalization of the trained network is much better, otherwise, the trained model may be even failed to work. However, the interference factors are complex and diverse in reality, which cannot be listed all here, and the more diversity of the dataset will inevitably be accompanied by much lower accuracy, as evaluated in section 4.3. Therefore, the generalization ability can only be achieved relatively, which is the inherent defect of deep learning methods.

In this paper, we analyze the common factors influencing the FPP system in reality based on the analysis of the mathematical expression of a fringe image, and we also list another two unique factors from the view of the objects. The common factors are recommended to be considered in most cases, while the unique factors should be analyzed and selected. Users can customize the datasets according to their needs by adding some unique factors (if exists) to the common ones, as the design of D₄ or D₅ in section 4.4.

Furthermore, some default factors are not considered in this paper, for example, the objects in all experiments are supposed to be Lambertian, and the lens distortion is neglected or thought to be corrected by pre-calibration, etc.

5.3 Scale ambiguity

The scale ambiguity problem is caused by the inconsistency of the scale space for the depth value in the virtual system and the real system. As shown in Fig. 16, the point A and the point B on two objects are imaged at the same point C by the camera, but the corresponding points are “imaged” in the projector as P_A and P_B, respectively. Therefore, the trained network can distinguish the depths of A and B by pairing C to P_A or P_B in a single fringe image under a fixed system setting. However, once the system settings are changed, the points P_A and P_B corresponding to C will be “imaged” differently, and this is the cause of the ambiguity.

Fig. 16. An FPP system

Download Full Size | PDF

In order to eliminate this ambiguity, one solution is to completely copy the real system to form a virtual system, as proposed in [26], which is limited since the rendered images and the trained network model can only be used to a fixed real FPP system. One of our undergoing study is to firstly render a large-scale dataset with the virtual FPP system being set by varying calibration parameters, and then train a network with the fringe images and the corresponding calibration parameters as the network input. However, in this paper, this problem has not yet been solved. Thus, the virtual system has to be set with the same calibration parameters if the real size of the 3D reconstruction is needed, otherwise, we can only get a shape without the real size.

6. Conclusion

Sufficient and diverse data is the guarantee of the application scope to the learning-based methods. Therefore, in this paper, we build a virtual FPP system with Blender for conveniently generating data, and we also analyze the key factors being able to be set in the virtual FPP systems to render images much close to the reality. To enhance the accuracy of the output depth image, we also propose an effective new loss function combining the SSIM index and Laplace operator. Abundant experiments are conducted, which prove that U-Net performs better in the task of depth image estimation and our proposed loss function improves the overall and detailed accuracy of the result. Furthermore, the experiments investigate the relationship of the generalization ability and accuracy to the diversity level of datasets. These works all provides good reference for improving the deep learning methods used in FPP.

Appendix A

The following explains the shading tree nodes of the projector in Blender:

1) Geometry-normal: “normal” refers to the vector pointing from the projector to a certain point on the surface of the object.
2) Mapping-Point: the rotation (X/Y/Z) here decides the direction of the emitting light.
3) SeparateXYZ-Divide-CombineXYZ: project the 3D vector to the xy-plane, so that the projected pattern is only related to the x and y coordinates, not to the z coordinate.
4) The second Mapping-point: change the position, direction, and size of the projection pattern. Because the origin of the projection pattern is in the upper left corner, the “X” and “Y” coordinates of the “Location” are offset by 0.5 meters.
5) sin0.bmp: set the projection pattern (fringe image).
6) Light Falloff: set the way the light intensity decreases with distance.
- a) Strength: light intensity before applying attenuation (Light Falloff node).
- b) Constant: set a constant light attenuation.
7) Emission: add Lambertian luminous shader for light output.
8) Light Output: light output.

Appendix B

Common settings:

Camera mode: Perspective

Camera field of view: 7°

Projector size: 0.001m

Position of the background wall when rendering depth image: (0, 0.05m, 0)

Position of the 3D model: (0, 0, -0.02m)

Position the projector: (0, -1.5m, 0)

Size of the 3D model: the maximal dimension is scaled to 0.14m.

The distance from the camera to the object: 1.55m

The intersection of the optical axis of the camera and the optical axis of the projector: (0, 0, 0)

Background of the fringe images in D₁ and D₂: all white

Note: When rendering depth images, keep the positions of the camera and the object unchanged, and import a plane behind the object, otherwise the depth image would record the depth of the background regions as infinite.

Funding

National Natural Science Foundation of China (61828501); Basic Research Program of Jiangsu Province (BK20192004C); Natural Science Foundation of Jiangsu Province (BK20181269).

Disclosures

The authors declare no conflicts of interest.

References

1. F. Tsalakanidou, F. Forster, S. Malassiotis, and M. G. Strintzis, “Real-time acquisition of depth and color images using structured light and its application to 3D face recognition,” RTI 11(5-6), 358–369 (2005). [CrossRef]

2. J. I. Laughner, S. Zhang, H. Li, C. C. Shao, and I. R. Efimov, “Mapping cardiac surface mechanics with structured light imaging,” Am. J. Physiol. Heart Circ. Physiol. 303(6), H712–H720 (2012). [CrossRef]

3. J. Xu, P. Wang, Y. Yao, S. Liu, and G. Zhang, “3D multi-directional sensor with pyramid mirror and structured light,” Opt. Lasers Eng. 93, 156–163 (2017). [CrossRef]

4. J. Burke, T. Bothe, W. Osten, and C. F. Hess, “Reverse engineering by fringe projection,” Proc. SPIE 4778, 312 (2002). [CrossRef]

5. V. Srinivasan, H. C. Liu, and M. Halioua, “Automated phase-measuring profilometry of 3-D diffuse objects,” Appl. Opt. 23(18), 3105–3108 (1984). [CrossRef]

6. M. Takeda and K. Mutoh, “Fourier transform profilometry for the automatic measurement of 3-D object shapes,” Appl. Opt. 22(24), 3977–3982 (1983). [CrossRef]

7. K. Qian, “Two-dimensional windowed Fourier transform for fringe pattern analysis: Principles, applications and implementations,” Opt. Lasers Eng. 45(2), 304–317 (2007). [CrossRef]

8. J. Zhong and J. Weng, “Spatial carrier-fringe pattern analysis by means of wavelet transform: wavelet transform profilometry,” Appl. Opt. 43(26), 4993–4998 (2004). [CrossRef]

9. G. Sansoni, S. Corini, S. Lazzari, R. Rodella, and F. Docchio, “Three-dimensional imaging based on Gray-code light projection: characterization of the measuring algorithm and development of a measuring system for industrial applications,” Appl. Opt. 36(19), 4463–4472 (1997). [CrossRef]

10. D. Zheng, Q. Kemao, F. Da, and H. Seah, “Ternary Gray code-based phase unwrapping for 3D measurement using binary patterns with projector defocusing,” Appl. Opt. 56(13), 3660–3665 (2017). [CrossRef]

11. J. M. Huntley and H. Saldner, “Temporal phase-unwrapping algorithm for automated interferogram analysis,” Appl. Opt. 32(17), 3047–3052 (1993). [CrossRef]

12. M. Zhang, Q. Chen, T. Tao, S. Feng, Y. Hu, H. Li, and C. Zuo, “Robust and efficient multi-frequency temporal phase unwrapping: optimal fringe frequency and pattern sequence selection,” Opt. Express 25(17), 20381–20400 (2017). [CrossRef]

13. R. M. Goldstein, H. A. Zebker, and C. L. Werner, “Satellite radar interferometry: Two-dimensional phase unwrapping,” Radio Sci. 23(4), 713–720 (1988). [CrossRef]

14. S. Zhang and S. Yau, “High-resolution, real-time 3D absolute coordinate measurement based on a phase-shifting method,” Opt. Express 14(7), 2644–2649 (2006). [CrossRef]

15. M. A. Schofield and Y. Zhu, “Fast phase unwrapping algorithm for interferometric applications,” Opt. Lett. 28(14), 1194–1196 (2003). [CrossRef]

16. K. Yan, Y. Yu, C. Huang, L. Sui, K. Qian, and A. Asundi, “Fringe pattern denoising based on deep learning,” Opt. Commun. 437, 148–152 (2019). [CrossRef]

17. F. Hao, C. Tang, M. Xu, and Z. Lei, “Batch denoising of ESPI fringe patterns based on convolutional neural network,” Appl. Opt. 58(13), 3338–3346 (2019). [CrossRef]

18. B. Lin, S. Fu, C. Zhang, F. Wang, and Y. Li, “Optical fringe patterns filtering based on multi-stage convolution neural network,” Opt. Lasers Eng. 126, 105853 (2020). [CrossRef]

19. S. Feng, Q. Chen, G. Gu, T. Tao, L. Zhang, Y. Hu, W. Yin, and C. Zuo, “Fringe pattern analysis using deep learning,” Adv. Photonics 1(02), 1 (2019). [CrossRef]

20. S. Feng, C. Zuo, W. Yin, G. Gu, and Q. Chen, “Micro deep learning profilometry for high-speed 3D surface imaging,” Opt. Lasers Eng. 121, 416–427 (2019). [CrossRef]

21. J. Zhang, X. Tian, J. Shao, H. Luo, and R. Liang, “Phase unwrapping in optical metrology via denoised and convolutional segmentation networks,” Opt. Express 27(10), 14903–14912 (2019). [CrossRef]

22. T. Zhang, S. Jiang, Z. Zhao, K. Dixit, X. Zhou, J. Hou, Y. Zhang, and C. Yan, “Rapid and robust two-dimensional phase unwrapping via deep learning,” Opt. Express 27(16), 23173–23185 (2019). [CrossRef]

23. K. Wang, Y. Li, Q. Kemao, J. Di, and J. Zhao, “One-step robust deep learning phase unwrapping,” Opt. Express 27(10), 15100–15115 (2019). [CrossRef]

24. S. Van der Jeught and J. Dirckx, “Deep neural networks for single shot structured light profilometry,” Opt. Express 27(12), 17091–17101 (2019). [CrossRef]

25. C. Wang, Q. Guan, and F. Wang, “Single stripe projection measurement method based on graphics and deep learning,” Chinese Invention Patent 201911260063 (10 Dec 2019).

26. Y. Zheng, S. Wang, Q. Li, and B. Li, “Fringe projection profilometry by conducting deep learning from its digital twin,” Opt. Express 28(24), 36568–36583 (2020). [CrossRef]

27. H. Nguyen, Y. Wang, and Z. Wang, “Single-Shot 3D Shape Reconstruction Using Structured Light and Deep Convolutional Neural Networks,” Sensors 20(13), 3718 (2020). [CrossRef]

28. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention (MICCAI) (2015), pp. 234–241.

29. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in 27th International Conference on Neural Information Processing Systems (NIPS) (2014), pp. 2672–2680.

30. P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition. (CVPR) (2017), pp. 1125–1134.

31. F. Gomez-Donoso, A. Garcia-Garcia, J. Garcia-Rodriguez, S. Orts-Escolano, and M. Cazorla, “LonchaNet: A sliced-based CNN architecture for real-time 3D object recognition,” in 2017 International Joint Conference on Neural Networks (IJCNN) (2017), pp. 412–418.

32. Y. Li, A. Dai, L. Guibas, and M. Nießner, “Database-assisted object retrieval for real-time 3d reconstruction,” Comput. Graph. Forum 34(2), 435–446 (2015). [CrossRef]

33. P. Stavroulakis, S. Chen, C. Delorme, P. Bointon, G. Tzimiropoulos, and R. Leach, “Rapid tracking of extrinsic projector parameters in fringe projection using machine learning,” Opt. Lasers Eng. 114, 7–14 (2019). [CrossRef]

34. 34. Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D ShapeNets: A deep representation for volumetric shapes,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 1912–1920.

35. A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An information-rich 3d model repository,” arXiv: 1512.03012v1 (2015).

36. S. Koch, A. Matveev, Z. Jiang, F. Williams, A. Artemov, E. Burnaev, M. Alexa, D. Zorin, and D. Panozzo, “ABC: A big CAD model dataset for geometric deep learning,” in 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 9601–9611.

37. Q. Zhou and A. Jacobson, “Thingi10K: A dataset of 10,000 3D-printing models,” arXiv: 1605.04797v2 (2016).

38. O. Yakovlyev, “Artist Workshop,” HDRI Haven, https://hdrihaven.com/hdri/?c=indoor&h=artist_workshop.

39. J. Versluis, “How to rotate a HDRI in Blender,” https://www.versluis.com/2020/07/rotate-hdri-in-blender/.

40. Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000). [CrossRef]

41. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process. 13(4), 600–612 (2004). [CrossRef]

42. X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Least Squares Generative Adversarial Networks,” in 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 2813–2821.

parameters	range
The period of fringes	[4.4, 6.6]
The rotation of fringes (°)	[-5, 5]
The angle between the optical axes of camera and projector (°)	[10, 20]
The power of the projector (W)	[20, 55]
The brightness of the ambient light	[0, 1]
The rotation of the environment map (°)	[0, 360]

		MAE	MSDE
U-Net	L1	0.0257	0.0762
	L2	0.0265	0.0725
	SSIM	0.0263	0.0713
	SSIM + Laplace	0.0260	0.0708
pix2pix	L1	0.0300	0.0974
	L2	0.0322	0.0924
	SSIM	0.0280	0.0813
	SSIM + Laplace	0.0281	0.0807

		MAE	MSDE
U-Net	D₁	0.0095	0.0381
	D₂	0.0197	0.0525
	D₃	0.0260	0.0708
pix2pix	D₁	0.0102	0.0412
	D₂	0.0209	0.0554
	D₃	0.0281	0.0807

parameters	range
The period of fringes	[4.4, 6.6]
The rotation of fringes (°)	[-5, 5]
The angle between the optical axes of camera and projector (°)	[10, 20]
The power of the projector (W)	[20, 55]
The brightness of the ambient light	[0, 1]
The rotation of the environment map (°)	[0, 360]

		MAE	MSDE
U-Net	L1	0.0257	0.0762
	L2	0.0265	0.0725
	SSIM	0.0263	0.0713
	SSIM + Laplace	0.0260	0.0708
pix2pix	L1	0.0300	0.0974
	L2	0.0322	0.0924
	SSIM	0.0280	0.0813
	SSIM + Laplace	0.0281	0.0807

Single-shot fringe projection profilometry based on deep learning and computer graphics

Abstract

1. Introduction

2. Construction of a virtual FPP system and the rendering of datasets

2.1 Selection of 3D models

2.2. Construction of a virtual FPP system

2.3 Factors enhancing the reality of the virtual FPP system

2.3.1 Period of fringes

2.3.2 Pose between the camera and the projector

2.3.3 Amplitude intensity and background

3. Networks and the designed loss function

3.1 Network architecture

3.1.1 U-Net

3.1.2 pix2pix

3.2 Proposed new loss function

4. Experiments

4.1 Dataset rendering and data preprocessing

4.2 Comparison of different loss functions

4.3 Relationship of the generalization ability and the accuracy to the diversity level of the dataset

4.4 Other unique interference factors in practical use

5. Discussion

5.1 Comparison between U-Net and pix2pix

5.2 Generalization ability of deep learning

5.3 Scale ambiguity

6. Conclusion

Appendix A

Appendix B

Funding

Disclosures

References

Cited By

Figures (16)

Tables (3)

Equations (10)

Optics Express