SE-FSCNet: full-scale connection network for single-shot phase demodulation

Zeyu Song; Junpeng Xue; Wenbo Lu; Ran Jia; Zhichao Xu; Changzhi Yu

doi:10.1364/OE.520818

1. Introduction

Fringe projection profilometry [1] is a very popular non-contact measurement technology, which has wide applications in intelligent manufacturing, reverse engineering, medical treatment and other fields [2,3]. This technology projects fringe gratings onto the surface of the tested object actively and modulates the three-dimensional information of the surface into the phase of fringe patterns. When calculating three-dimensional geometric morphology information through modulated fringe patterns, phase demodulation and phase unwrapping technologies are important steps. The accuracy of wrapped phase largely determines the three-dimensional measurement accuracy of fringe projection profilometry.

In the past few decades, numerous methods on phase demodulation have been proposed, which can be roughly divided into two categories: (1)spatial phase demodulation methods; (2)phase-shifting method. The Fourier transform method(FT) [4] is the most commonly used method in spatial phase demodulation. In 1983 Takeda et al. first used this method to achieve phase extraction and 3D reconstruction of single-shot fringe patterns. Unlike the pixel by pixel calculation of phase-shifting method, FT can simultaneously process all pixels of one fringe pattern. It has high motion robustness, but the mutual influence between pixels limits the accuracy of this method [5]. Therefore, developed on the basis of FT, multiple studies on block-by-block processing fringe patterns have been proposed, aiming to achieve unity in pixel-level processing and global processing. Typical effective methods include windowed Fourier transform and wavelet transform [6,7]. Phase-shifting method projects multiple fringe pattern sequences with different initial phases(usually requiring at least three) to demodulate the phase information of the object surface [8]. This method was first introduced in fringe projection profilometry by Srinivasan et al. [9] in 1984. Compared with spatial phase demodulation methods, phase-shifting method can achieve higher accuracy and resolution in pixel-level phase measurement [10]. It also has high robustness against uneven background intensity and surface modulation [11]. However, the working characteristics of projecting multiple fringe patterns greatly reduce the efficiency of this method, and external interference and vibration will introduce unacceptable errors. In summary, both traditional methods mentioned above cannot balance the accuracy and efficiency of phase demodulation. The former is difficult to handle objects with large morphological gradients or complex details, while the latter is not suitable for the dynamic measurement applications.

In recent years, with the rapid development of deep learning theory and computer hardware, deep learning technology has shown outstanding performance in many scientific and engineering fields. In the field of computer vision, methods based on convolutional neural networks and Transformer network have shone through. Numerous scholars have conducted detailed organization and reviews of the application of deep learning in phase measurement [12–14]. They demonstrate that in most cases deep learning methods can achieve better performance than traditional technologies in various stages of fringe projection measurement, including preprocessing, phase analysis and post-processing. A small portion of researches use deep learning for data processing, such as using deep learning methods for denoising in preprocessing [15,16] and error correction in post-processing [17,18], to improve the input or output data of traditional methods. More researches focus on the most important aspect—phase analysis. Due to the end-to-end nature of deep learning, various methods have been proposed to explore the cross-stage acquisition of target physical quantities, like from fringe patterns to unwrapped phase and from fringe patterns to depth maps [19,20]. Li et al. [21] used neural network to recover unwrapped phase from single-shot spatially multiplexing fringe pattern which contains different frequency fringes and avoided the spectrum-aliasing problem of traditional spatial-multiplexing methods. However, due to different reflectivity in the intensity of the fringe pattern, it is difficult for end-to-end deep learning network to accurately establish high-quality unwrapped phase distribution maps. Van et al. [22] proposed a new method based on deep learning to obtain height information from single-shot fringe patterns. A dataset of simulated fringe patterns and corresponding depth maps was designed and constructed to train the fully convolutional neural network. Experimental results proved that the network performed well in demodulating unseen modulated fringe patterns. However, both the training dataset and the unseen test data were generated by simulation and the effectiveness of this network when processing fringe patterns from real scenes has not been verified yet. Zheng et al. [23] used calibration matrices from real-world fringe projection profilometry(FPP) systems to establish virtual digital twin models and provided an effective and economical training dataset for deep learning. Real-world experiments proved that the neural network predicted 3D geometry from single-shot fringe pattern successfully. However, the effectiveness and fidelity of synthetic datasets, as well as the noise simulation of the real-world imaging systems, still need to be evaluated and studied. The above researches on directly predicting unwrapped phase or depth information from fringe patterns presents some issues of cross-stage deep learning methods, such as the need for improvement in accuracy and questionable performance in the real world. In fact, obtaining the unwrapped phase from single-shot fringe pattern directly requires the network to simultaneously extract two types of feature information: wrapped phase and fringe order. The changes of the two in the same cycle are completely different, which brings great difficulties to the learning of the network. Directly obtaining depth data from single-shot fringe patterns makes it impossible for the network to utilize important intermediate variable information and the prediction error of each stage will easily accumulate in the final result. Some studies [24,25] have held negative attitude towards end-to-end cross-stage deep learning methods and demonstrated the superiority of using deep learning in stages to obtain intermediate variables such as wrapped phase and fringe order in both theories and experiments. And numerous studies based on deep learning methods have been conducted to replace traditional methods in the single technical stage of fringe projection measurement. They improve the obtaining accuracy and efficiency of intermediate variables and even the entire 3D measurement results [26–31].

Using deep learning methods to achieve phase demodulation from single-shot modulated fringe patterns has attracted considerable attention and research from scholars. Yan et al. [32] first introduced neural networks into Fourier transform profilometry and used a three-layer backpropagation neural network to learn the continuous approximation function of discrete fringe patterns. Then the half cycle wrapped phase of the fringe patterns can be calculated. This method has higher spatial resolution and better restoration of complex details compared to Fourier transform profilometry. Feng et al. [33] demonstrated that deep learning is an effective solution for phase demodulation and verified that in terms of accuracy and object edge information preservation, deep learning methods outperformed two representative single-shot phase demodulation methods: FT and windowed Fourier transform method(WFT) in the experiments. The following year, this team proposed micro deep learning profilometry for high-speed 3D surface imaging [34]. This method used a trained neural network to extract wrapped phase from three fringe patterns with different fringe frequencies and achieved 3D reconstruction at a speed of 20000 frames per second. Compared with FT and WFT, this method reduces the error in phase demodulation to one-third. It preserved more edge and detail information of the measured object. The final 3D reconstruction accuracy is comparable to the 12-step phase shifting method. Yang et al. [35] proposed a single-shot phase extraction method based on the deep convolutional generative adversarial network. At the same time, a composite loss function and a large-scale real-world fringe image dataset are specifically designed. Compared with FT, this method exhibits better performance in processing fringe patterns with complex surfaces, abrupt edges and different projection frequencies. It can also achieve automatic denoising. Li et al. [36] proposed a phase demodulation and unwrapping method based on multi frequency composite fringe patterns. Two neural networks were used to predict the wrapped phase and coarse-precision unwrapped phase from single-shot multi frequency composite fringe patterns respectively. Then the high-precision unwrapped phase can be calculated. The performance of this method is obviously improved compared to FT and the unwrapped phase accuracy predicted from single-shot fringe patterns can be comparable to the unwrapped phase results calculated by the traditional 12-step phase shifting method and the time phase unwrapping method. Nguyen et al. [25] proposed a method that combines fringe-to-phase network and fringe projection profilometry. It projected single-shot color patterns onto the tested object, where the RGB channels contained fringe information of different frequencies. Then the information of each channel was separated and the wrapped phase was predicted respectively by the neural network. Subsequently, phase unwrapping and 3D reconstruction were carried out using heterodyne methods. In experiments the performance of this method was better than that of end-to-end networks which directly predicted depth from fringe patterns. It has little difference in accuracy but significant improvement in efficiency compared to phase-shifting profilometry.

The phase demodulation task requires the network to input modulated fringe patterns and output wrapped phase patterns of the same resolution. So it belongs to the deep learning type of equal-size image conversion. The efficient extraction and utilization of features at different scales by networks is a prerequisite for accurately predicting the overall phase distribution and local surface details of fringe patterns. In recent years, researches on phase demodulation methods based on deep learning have tended to use U-Net [37] as the main body of neural network models for equal-size image conversion. For example, the generator of generative adversarial networks in Ref. [35], as well as the network subjects in Refs. [36] and [25], have all adopted the U-Net model. This model sequentially upsamples and downsamples the input images and extracts feature information from different scales. Then it achieves equal-size feature fusion through skip connections. However, feature information at different scales can only be transmitted layer by layer, which inevitably leads to the loss of information. So it is difficult for this model to extract sufficient feature information from all scales and the accuracy of the output wrapped phase will also be limited. Therefore, this paper proposes a method for fringe projection phase demodulation based on a full-scale connection network. We design a full-scale connection network model based on the task characteristics of equal-size image conversion and integrate Squeeze and Excitation channel attention modules(Squeeze and Excitation: SE) to enhance the network's acquisition and utilization of feature information from different scales. Ultimately the network called SE-Full-Scale-Connection Net(SE-FSCNet) is formed for phase demodulation. The SE-FSCNet trained on the dataset predicts the numerator and denominator of the wrapped phase arctangent function from single-shot fringe patterns and then obtains the wrapped phase through simple calculation. It can achieve higher accuracy than the U-Net model and the traditional FT method. The rest of this paper is organized as follows. Section 2 provides a detailed description of the phase-shifting method for preparing high-precision wrapped phase labels, the proposed network architecture and a comparative analysis of two networks’ intermediate features which use different hierarchical connection methods; Section 3 presents the ablation experiment of the internal modules in the proposed network, as well as the comparative experiment conducted between the U-Net model and our model, along with the summary and discussion in Section 4.

2. Theories and methods

2.1 Multi-step phase-shifting method and dataset construction

Generally speaking, it is quite challenging to synthesize a fringe projection dataset with real noise, lighting and object surface reflectance. There is a significant difference between synthetic data and real data captured and produced in real scenes. The network trained with synthetic data may not perform well in real scene, so the real dataset should be chosen as much as possible. The phase labels of real datasets are usually generated using high-precision methods which are based on traditional physical models. Among the current methods of phase demodulation, the multi-step phase-shifting method has the highest measurement accuracy and noise resistance. Therefore, numerous studies have chosen the multi-step phase-shifting method as the generation method for dataset labels [25,33–36]. The relationship between the fringe images and the wrapped phase in the N-step phase-shifting method can be obtained by equation(1):

(1)$${\boldsymbol \varphi }(x,y) = \arctan \frac{{{\boldsymbol M}(x,y)}}{{{\boldsymbol D}(x,y)}} = \arctan \left( {\frac{{\sum\limits_{n = 0}^N {{{\boldsymbol I}_n}(x,y)\sin \left( {\frac{{2\mathrm{\pi }n}}{N}} \right)} }}{{\sum\limits_{n = 0}^N {{{\boldsymbol I}_n}(x,y)\cos \left( {\frac{{2\mathrm{\pi }n}}{N}} \right)} }}} \right), $$

where the number of phase-shifting steps $n = 0,1,\ldots ,N - 1$. $(x,y)$ represents the pixel coordinates. ${\boldsymbol I}$ represents the intensity of image pixels. ${\boldsymbol \varphi }$ represents the phase image to be measured. The numerator and denominator terms of the fractions on the right side of this equation are represented by ${\boldsymbol M}$ and ${\boldsymbol D}$ respectively. With equation(1), high-precision wrapped phase labels corresponding to N fringe images can be obtained.

Based on these considerations, we select the phase demodulation dataset constructed by Zuo et al. [12] with the three-step phase-shifting method for the training, validation and test of neural networks. This dataset contains a total of 1000 pairs of fringe images and wrapped phase images. The dataset includes 120 scenes overall, with 2-26 captured images for each scene. In each scene, the tested objects will be rotated to multiple angles and then a series of fringe images will be captured. The number of fringe images captured in each scene is related to the number and the surface complexity of objects in this scene. Scenes with high difficulty in phase demodulation usually have more fringe images. The tested objects in these 120 scenes include metal industrial parts, gypsum models, medical models and other common objects from life or industry. They will be placed separately or combined together to form the tested scene. When creating one data pair of fringe pattern and wrapped phase map, three fringe patterns with a phase shift of $\frac{{2\mathrm{\pi }}}{3}$ are sequentially projected onto the object. Then the first modulated fringe pattern captured by the camera is used as input data for the network and all three modulated fringe patterns are used to generate the labels of the network outputs—the numerator ${\boldsymbol M}$ and denominator ${\boldsymbol D}$ terms of the wrapped phase arctangent function in this scene. The resolution of all images is normalized to 640 × 480 pixels. Figure 1 shows examples of data pairs for different types of tested objects in the phase demodulation dataset.

Fig. 1. Examples of input and corresponding truth labels in the dataset.

Download Full Size | PDF

The dataset contains 1000 data pairs of input and truth labels totally. To ensure the convergence and generalization of the network, 1000 pairs of data were divided into training, validation and test sets in a ratio of 60%, 20% and 20%. The training set will participate in the training phase of the network to assist in iterative optimization of network parameters. Its truth labels can provide a basis for the network to calculate loss functions and backpropagation gradients. All the fringe patterns and wrapped phase maps in the training set are visible to the neural network. The validation set is also used for network training. Neural networks predict fringe images in the validation set but cannot optimize their own parameters using truth labels. People will observe the predicted results and adjust the hyperparameters of the neural network based on the results. The test set will be used to test the trained network. The fringe patterns and truth labels in the test set are unfamiliar to neural networks and have never been seen before during the training phase. This arrangement can ensure that the network will not exhibit false “high performance” due to overfitting phenomenon.

2.2 Design of the proposed full-scale connection network architecture

A network that only sets skip connections between the same hierarchy cannot efficiently transmit and utilize feature information from all scales. To address this issue, a full-scale connection network SE-FSCNet is proposed, whose skip connections and feature fusion modules enable each hierarchy in the network decoder to directly receive and utilize full-scale feature information. In addition, a channel attention module based on squeeze and excitation principles has been added to the network to assign different weights to each channel from different hierarchies and help the network integrate multi-scale feature information.

Figure 2 shows the phase demodulation process and internal structure of SE-FSCNet. The input of the network is a single-shot fringe pattern with a resolution of 640 × 480 pixels. The encoder of the network is similar to U-Net, consisting of five encoding layers and four pooling layers. Each encoding layer contains two identical convolutional layers to extract input feature information. The size of kernels in the convolution layer is 3 × 3 and the stride is 1. “ReLU” is selected as the activation function. The convolutional blocks at different hierarchies are connected with 2 × 2 max-pooling layers and higher dimensional features of input information are extracted by increasing channel depth and reducing spatial resolution. For ease of description, the five encoding layers arranged in forward propagation order are called E1, E2, E3, E4 and E5. The decoder of the network has four decoding layers called D1, D2, D3 and D4. Their feature scales correspond to E1∼E4. Table 1 shows the sequential structure and layer parameters of the network(Due to space constraints, the layer parameters in each feature fusion module and attention module are not detailed in the table). Each layer of the network outputs a third-order tensor and the size of the output tensor in the table is denoted as $H \times W \times C$, where H and W represent the width and height of the tensor. C represents the channel number of the tensor and also the filter number used in the convolutional layer.

Fig. 2. Phase demodulation process and internal structure of full-scale connection network SE-FSCNet. (a)The process of predicting wrapped phase using SE-FSCNet. Using a fringe pattern as input, the trained network outputs the numerator and denominator terms of the wrapped phase arctangent function and then the wrapped phase can be calculated using equation(2), (b)The internal structure diagram and legend of SE-FSCNet.

Download Full Size | PDF

Table 1. The sequential structure and layer parameters of the SE-FSCNet

View Table | View all tables in this article

Each decoding layer of the network integrates small-scale and equal-scale features from the encoding layer, as well as large-scale features from the decoding layer. These two kinds of features contain low-level semantic information and high-level semantic information respectively. In order to ensure the feasibility of full-scale feature information fusion, a feature aggregate module(FA module) is designed. It is difficult to directly concatenate feature information from different scales. So firstly the FA module downsamples the features from the small-scale encoding layer using max pooling and upsamples the features from the large-scale decoding layer using bilinear interpolation. As a result, the resolution of feature information from different hierarchies is unified. Subsequently, the FA module uses 5 convolutional layers containing the same number of filters to process feature information from different hierarchies and unify their channel numbers. Then all the feature information is concatenated and fused together in the channel dimension. Finally, convolution and activation operations are performed again. Taking the decoding layer D3 as an example, the process of integrating full-scale feature information in the FA module is shown in Fig. 3.

Fig. 3. Schematic diagram of the process of integrating full-scale feature information in the FA module.

Download Full Size | PDF

When the FA module implements feature fusion, it uses the same number of channels to concatenate feature information from 5 different hierarchies, which is equivalent to artificially assigning the same weight to different scale feature information. However, this weight allocation often does not match reality and how to allocate the weight should be learned by the network itself. Therefore, a channel attention module based on Squeeze and Excitation(SE) is added between the last FA module of the decoder and the decoding layer D1, abbreviated as the SE module [38]. When features with a dimension of $H \times W \times C$ are input into the SE module, the two-dimensional feature of each channel are compressed into a real number through global average pooling firstly. This number is equivalent to having the global receptive field of that channel. At this point the feature size is $1 \times 1 \times C$. Then the correlation between channels is constructed through two fully connected layers and a weight value is learned and generated for each feature channel. Finally, the normalized weights are assigned to the feature information of each channel using element-wise multiplication. The specific working process of the SE module is shown in Fig. 4:

Fig. 4. Schematic diagram of the process of weighting channel feature information in SE module.

Download Full Size | PDF

The network finally sends the feature information output by the SE module into two convolutional layers and generates the final output result of the network—the numerator and denominator terms of the arctangent function of the wrapped phase. The wrapped phase can be easily calculated using equation (1). Considering that there is a jump in the wrapped phase at the edge of each cycle, this sharp discontinuity poses a challenge to the learning of neural networks. Therefore, in the selection of network output, the wrapped phase will not be directly predicted from the fringe pattern.

2.3 Comparative analysis of intermediate features of SE-FSCNet and U-Net

Neural networks achieve direct mapping from input to output end-to-end. Although it has a “black box” attribute which means internal states cannot be observed, it is still possible to visualize the output of intermediate layers at different positions in the network and understand how input is transformed into feature information learned by the network.

To verify the superiority of the proposed network in full-scale feature extraction and utilization, we selected the most widely used U-Net model in phase demodulation and the SE-FSCNet model proposed in this paper. After initializing the two networks, the input fringe patterns are propagated forward and the resulting intermediate layer feature output will be used for visualization, comparison and analysis. The two networks have the same encoder. In detail, their encoding layers and downsampling method are completely consistent. Therefore, the visualization of the intermediate feature output of the encoder is performed together, as shown in Fig. 5. The convolutional layers to output features are labeled with black arrows in Fig. 5(a) and named encoder layers1∼5 based on the shallow to deep positions in the network. The features output by these layers have three dimensions: width, height and channel. Each channel contains relatively independent feature information. So we display the feature of each channel as a two-dimensional image and concatenate the images generated by all channels to form the feature information sets of encoder layers1∼5, as shown in Fig. 5(b). Due to the different depths of encoder layers1∼5 in the network, there are differences in the receptive fields of their convolutional kernels. As a result, the features of these layers also have different characteristics. Shallow convolutional layers extract fine-grained features that contain rich spatial information such as surface details, textures and contours. Deep convolutional layers extract coarse-grained features representing information of a region. They have rich semantic information but low resolution.

Fig. 5. The feature output positions of U-Net and SE-FSCNet encoder and their respective feature information sets. (a)The feature output positions of two networks’ encoder. Each output position is located after the two convolutional layers of the convolutional module. (b)The graph of the feature output sets of the encoder. The feature output from top to bottom comes from encoder layer1 to encoder layer5 respectively. The dashed box displays the enlarged details of a certain channel.

Download Full Size | PDF

The connection of different hierarchies between the network encoder and decoder determines whether the network can effectively utilize the feature information of each scale. The decoding layer of U-Net model is usually connected to the decoding layer at the previous hierarchy and the encoding layer at the same hierarchy, which is not conducive to the transmission of feature information between distant hierarchies. The range of feature scales that each decoding layer can access is relatively limited, only containing features of the same and the smaller scale. Figure 6 visualizes the output of each convolutional module at the U-Net decoder, with the output positions marked by black arrows in Fig. 6(a). They are named decoder layer4∼1 according to their position order from deep to shallow in the network.

Fig. 6. The feature output positions of the U-Net model decoder and their respective feature information sets. (a)The feature output positions of the U-Net model decoder. (b)The feature information output set of decoder layer1. The set contains a total of 64 channels and each channel contains feature information with a resolution of 640 × 480 pixels. The dashed box represents enlarged display of certain typical feature channels, where orange represents surface gradient features, green represents phase distribution features and red represents edge contour features. (c)The feature information output set of decoder layer2. The feature information size is 320 × 240 × 128. The meanings of different color dashed box is the same as (b). (d)The feature information output set of decoder layer3. The feature information size is 160 × 120 × 256. The meanings of different color dashed box is the same as (b). (e)The feature information output set of decoder layer4. The feature information size is 80 × 60 × 512. The meanings of different color dashed box is the same as (b).

Download Full Size | PDF

Figures 6(b)∼(e) visualize the feature output of each hierarchy at the decoder and zoom in on certain channels. It can be seen that the granularity, resolution and semantic richness of the feature information which is output by each hierarchy of the U-Net decoder are consistent with the feature information of the corresponding hierarchy at the encoder. The feature information within the same hierarchy generally exhibits the phenomenon of single scale. As a result, when a hierarchy at the decoder collects feature information through upsampling, the scale of all received feature information is the same, excluding smaller scale feature information from deeper hierarchies. The U-Net model still has certain shortcomings in the transmission and utilization of full-scale features.

The SE-FSCNet proposed in this paper designs a new connection method between the encoder and decoder, as shown in Fig. 2(b). Each hierarchy of the SE-FSCNet decoder is connected to five different depth hierarchies through the FA module, allowing for direct access to full-scale feature information. Taking the output of the last convolutional layer at the SE-FSCNet decoder as an example, we verify whether the full-scale connection design can help the intermediate feature output at the decoder include multiple scales, as shown in Fig. 7.

Fig. 7. The feature output position and feature information set of SE-FSCNet decoder. The output position of feature information is the last convolutional layer at the SE-FSCNet decoder. The set contains a total of 320 channels. The resolution of feature in each channel is 640 × 480 pixels. The dashed box represents enlarged display of certain typical feature channels, where orange represents surface gradient features, green represents phase distribution features and red represents edge contour features.

Download Full Size | PDF

The selected position outputs a set of feature information whose size is 480 × 640 × 320. The set is concatenated on the channel dimension and forms the feature map on the right side of Fig. 7. Due to space limitations, it is impossible to display the feature of each channel one by one, making it difficult to directly observe the diverse scales of feature information in the map. Therefore, in Fig. 7 we select some typical features, such as contour features, phase distribution features, and surface gradient features. They are highlighted with red, green and orange dashed boxes respectively. The output information of each decoding layer of U-Net are filtered and the same three types of features are selected. These three types of intermediate features in the two network models are arrange in ascending order of scale and compared in Table 2, as shown below.

Table 2. Comparison of intermediate feature outputs between U-Net and SE-FSCNet

View Table | View all tables in this article

From Table 2, It is obviously that the feature scales output by the four decoding layers of the U-Net model are limited to a very small range and there are almost no differences in granularity and semantic information between feature maps of different channels within the same hierarchy. The output of the last convolutional module of the SE-FSCNet decoder contains features with at least four different scales. The results prove that the full-scale connection method of SE-FSCNet has better abilities of feature transmission and utilization compared with the equal-scale connection method of U-Net. The advantage of this capability not only helps SE-FSCNet improve the performance, but also demonstrates its adaptability to attention modules such as the SE module. Because the lossless transmission of multi-scale feature information generated in the image encoding process is the basis for the screening and weight allocation of feature information in the image decoding process.

3. Experiments

3.1 Environment and training hyperparameter settings

We built a deep learning environment and the network architecture above on a computer and conducted relevant experiments. The experiments used a computer with an Inter Core i7-9700 CPU along with GeForce RTX 2080 Ti graphics card and were conducted on Ubuntu18.04. The experiments selected TensorFlow, an open-source deep learning platform, and Keras, a neural network programming interface developed on the base of TensorFlow, for the construction, training and prediction of the network. In addition, Nvidia CUDA9.0 and cuDNN7.0 were used to accelerate the training process.

By referring to the experience and tricks of training neural networks and comparing multiple experimental results, the input data range and hyperparameter settings of the network were determined. For the input single-shot fringe patterns, the intensity of each pixel is compressed from [0,255] to [0,1] through normalization. This helps to reduce the learning difficulty of the network and improve the convergence together with stability of the network. SE-FSCNet uses adaptive moment estimation(Adam) [39] as an optimizer for 300 epochs of iterative optimization, with an initial learning rate set to 0.001. The mini-batch gradient descent method is selected as the optimization method for SE-FSCNet and the batch size is set to 2. This method not only overcomes the limitations of computer memory, but also frequently update network parameters to accelerate convergence of the network. Mean Square Error(MSE) is selected as the loss function of the network and the calculation equation is as follows:

(2)$$L\textrm{oss}(\theta ) = \frac{1}{{L \times W}}[{{{||{{{\boldsymbol P}_M} - {{\boldsymbol G}_M}} ||}^2} + {{||{{{\boldsymbol P}_D} - {{\boldsymbol G}_D}} ||}^2}} ], $$

where $\theta$ represents the current parameters of the model. L and W represent the length and width of the pattern matrix. ${{\boldsymbol P}_M}$ and ${{\boldsymbol P}_D}$ respectively represent the numerator and denominator terms of the wrapped phase arctangent function predicted by the network. ${{\boldsymbol G}_M}$ and ${{\boldsymbol G}_D}$ are their corresponding truth labels. When the loss function on the validation set reaches a new minimum point, the current model will be saved.

3.2 Ablation experiment, feasibility test and accuracy testing

In order to verify whether the combination of the SE modul and the full-scale connection method in the proposed network can have a positive impact on the performance of the network, the first experiment in this section was designed,which is the ablation experiment of the SE module. We remove the SE module from SE-FSCNet and name the remaining network structure FSCNet. FSCNet is trained with the same hyperparameter settings and dataset partitioning method as SE-FSCNet until it converge. A scene is randomly selected from the test set as the test sample. Note that all scenes in the test set will not participate in the training stage of the network. The scene is input into FSCNet together with SE-FSCNet and the accuracy of the prediction results are compared, as shown in Fig. 8. Figure 8(a) shows the predicted wrapped phase and corresponding error distribution maps. Figure 8(b) shows enlarged error maps corresponding to some complex areas of the fringe pattern.

Fig. 8. Comparison of results and error maps between FSCNet and SE-FSCNet in predicting wrapped phase from a random scene. (a)The predicted wrapped phase and the corresponding error maps by FSCNet and SE-FSCNet. The two gray images on the left represent the predicted wrapped phase of two networks, with a range of $( - \mathrm{\pi },\mathrm{\pi }]$. The two dark-blue images on the right represent the absolute error maps between the prediction results and the true value, with a range of [0,1]. (b) Enlarged display and comparison of prediction error in some complex areas of the fringe pattern. The red and green dashed boxes respectively highlight the surface details in different areas of the fringe pattern. The absolute error of the two network prediction results in these areas are enlarged and compared on the right, with a range of [0,1].

Download Full Size | PDF

Quantitative analysis of the error between truth labels and the predicted wrapped phase of each network was conducted with three indices: Mean Squared Error(MSE), Mean Absolute Error(MAE) and Structural Similarity(SSIM). The calculation method of mean square error (MSE) amplifies the impact of large numerical errors on overall accuracy, reflecting the stability of deviations between predicted results and true values; Mean absolute error (MAE) directly reflects the absolute accuracy of the prediction results. Structural similarity (SSIM) measures the brightness, contrast and structural similarity of an image using a combination of mean, standard deviation and covariance. The results are shown in Table 3. From the data comparison, it can be seen that the prediction accuracy and similarity of SE-FSCNet are better than those of FSCNet, proving that the SE module helps the network better utilize full-scale features and promote the performance improvement of the network.

Table 3. Quantitative analysis of phase demodulation errors predicted by FSCNet and SE-FSCNet

View Table | View all tables in this article

In order to directly verify the feasibility of the proposed phase demodulation method based on the SE-FSCNet model, the second experiment designed in this section compares the most commonly used methods in the field of phase demodulation—FT and multi-step phase-shifting method—with our method in terms of accuracy in obtaining wrapped phase. In practical applications, due to the limitations of spatial resolution, traditional single-shot phase demodulation technologies have relatively poor phase demodulation accuracy when facing complex objects with a large surface gradient. Therefore, in order to synchronously verify the phase demodulation ability of our method on complex surface objects, in the second experiment a scene containing two separated plaster models was randomly selected as the test sample from the test set. The two plaster models have many texture details, large surface gradients, and a difference in distance from the camera. Figure 9(a) shows the truth label generated by the multi-step phase-shifting method of this scene. Figures 9(b) and 9(c) respectively show the spectrum and the wrapped phase map obtained by FT. The fringe pattern of this scene is input into the trained SE-FSCNet and the numerator and denominator terms of the wrapped phase arctangent function are predicted, as shown in Figs. 9(d) and 9(e) respectively. The final wrapped phase is calculated with equation (1), as shown in Fig. 9(f).

Fig. 9. Phase demodulation results of phase-shifting method, FT and our method in a random separate complex scene. (a)The wrapped phase map calculated by the multi-step phase-shifting method. (b)Fourier spectrum. (c)The wrapped phase map calculated by FT. (d)The numerator term of the wrapped phase arctangent function predicted by SE-FSCNet. (e)The denominator term of the wrapped phase arctangent function predicted by SE-FSCNet. (f)The wrapped phase predicted by SE-FSCNet.

Download Full Size | PDF

The multi-step phase-shifting method gains high measurement accuracy at the cost of measurement efficiency and has an absolute advantage in accuracy among commonly used methods of phase demodulation. Therefore, we use the wrapped phase map calculated by the three-step phase-shifting method as the truth label to evaluate the phase demodulation accuracy of FT and our method. The phase demodulation error of the two methods are shown in Fig. 10(a) and 10(b). Due to the relatively large error range of FT, the colorbar interval of the error map is set to $[{0,2\mathrm{\pi }} ]$. The phase error of each pixel in the effective area of the tested scene is counted. The pixels with different errors are distributed to different error intervals and presented in the form of a bar graph. The error distribution of the two methods is shown in Fig. 10(c) and 10(d) respectively. It can be seen that there are more pixels distributed in the higher error intervals among the wrapped phase map calculated by FT while the prediction error of our method is concentrated in the lower error intervals.

Fig. 10. The wrapped phase error maps and error distribution of FT and the our method. (a)The wrapped phase error map calculated by FT. (b)The wrapped phase error map predicted by SE-FSCNet. (c)The distribution of wrapped phase error calculated by FT. The red dashed box represents the number of pixels in the high error intervals. (d)The distribution of wrapped phase error predicted by SE-FSCNet. The red dashed box represents the number of pixels in the high error intervals.

Download Full Size | PDF

Using the same indices as Table 3, the accuracy of the results of FT and our method in a random complex scene was further quantitatively measured and compared, as shown in Table 4. From the data in Table 4, it can be seen that the phase prediction error of the proposed method is approximately one fourth of the error of FT. The former is quite close to the accuracy of the multi-step phase-shifting method.

Table 4. Quantitative analysis of phase demodulation error predicted by FT and SE-FSCNet

View Table | View all tables in this article

In the design stage of the network model in this paper, our network is referred to and compared with the widely used U-Net model in the field of phase demodulation. The full-scale connection method of SE-FSCNet is inspired by the limitations of the equal-scale skip connection method of U-Net. Therefore, the third experiment in this section compares the performance of our method and U-Net model in predicting wrapped phase. The hyperparameters of the U-Net model have been tuned carefully to achieve the best results possible. A scene was randomly selected from the test set and input into two networks. The prediction results and the data analysis are shown in Fig. 11. Figure 11(a) shows the truth label generated by the phase-shifting method and the wrapped phase maps predicted by two networks. In Fig. 11(b), a row of data in the middle of the phase map is selected for comparison and its local details are enlarged. It is found that in areas with geometric discontinuities and phase jumps on the surface of objects, the phase demodulation results of SE-FSCNet are more in line with the true values and have higher prediction accuracy.

Fig. 11. Phase demodulation results of SE-FSCNet and U-Net along with the data analysis. (a)True value labels and wrapped phase prediction results of SE-FSCNet and U-Net. The red, green and blue dashed lines in the figure indicate where the three phase data in Fig. 11(b) come from. (b)The comparison between the 240th row phase data of the truth label, the result of SE-FSCNet and the one of U-Net. The purple and yellow dashed boxes in the figure respectively correspond to the details of the three types of phase data near the 50th and 452nd column.

Download Full Size | PDF

Figures 12(a) and 12(b) show the error maps of the wrapped phase predicted by U-Net and SE-FSCNet respectively. In order to compare the error maps in more detail, Fig. 12(c) enlarges the phase error of two complex regions in the fringe pattern,where the facial features and the curly hair on the head locate in. It can be seen that when facing complex areas on the surface of the tested object, the prediction of U-Net have obvious error while the prediction error of our method is low relatively. SE-FSCNet has better accuracy and performance in phase demodulation when facing surface with geometric discontinuity and high detail complexity. The MSE, MAE and SSIM indices of the two network results compared to the true values are shown in Table 5. The results prove that SE-FSCNet performs slightly better than U-Net in all three indices, which means that in the phase demodulation task, when facing surface with geometric discontinuity and high detail complexity the performance of SE-FSCNet has comprehensively surpassed U-Net.

Fig. 12. Comparison of phase demodulation error between U-Net and SE-FSCNet. (a)The error map of the wrapped phase predicted by U-Net. (b)The error map of the wrapped phase predicted by SE-FSCNet. (c)Amplification and comparison of error in complex regions of the fringe pattern. The red and green dashed boxes respectively outline the surface details at different positions in the fringe pattern and the absolute error of the prediction results at these positions are enlarged and compared on the right, with a range of [0,1].

Download Full Size | PDF

Table 5. Quantitative analysis of phase demodulation error predicted by U-Net and SE-FSCNet

View Table | View all tables in this article

4. Summary and discussion

In this paper, we propose a single-shot phase demodulation method for fringe projection based on a full-scale connection network SE-FSCNet. A full-scale feature connection method is designed based on the characteristics of the phase demodulation task and a channel attention module based on squeeze and excitation is introduced. Compared with two classic phase demodulation methods, FT method and multi-step phase-shifting method, the method based on SE-FSCNet model requires only single-shot modulated fringe pattern as available information. At the same time, its accuracy is quite close to multi-step phase-shifting method in predicting the wrapped phase. It overcomes the problem of difficulty in unifying accuracy and efficiency when calculating wrapped phase. The experiments proved that the full-scale feature connection method proposed in this paper had a positive effect on improving the performance of the network. The adaptability of the SE module to the full-scale feature connection method is also verified. In addition, the experiment also compared SE-FSCNet with U-Net, which is a mainstream model among phase demodulation methods based on deep learning and SE-FSCNet showed superior performance in phase demodulation. In summary, the SE-FSCNet model proposed in this paper can achieve high-precision phase demodulation of single-shot fringe patterns, with better performance than traditional methods and the mainstream model. It can be applied to fringe projection measurement and imaging technologies requiring phase demodulation. Due to limitations in computational memory, this paper did not add more complex attention modules to more locations in the network. However, the better abilities of SE-FSCNet in full-scale feature transmission and utilization demonstrate its adaptability to attention modules. This means that SE-FSCNet has greater potential for application in subsequent research and development. When combined with more complex attention mechanisms, it will show better performance.

Funding

Sichuan Science and Technology Program (2023YFG0181).

Acknowledgments

The authors would like to thank Lei Huang and Mourad Idir in the Optics and Metrology group of NSLS-II for helpful language revision and advices. The authors also would like to thank the anonymous reviewers and the associate editor for their insightful comments that significantly improved the quality of this paper.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. Zhang, “Recent progresses on real-time 3D shape measurement using digital fringe projection techniques,” Opt. Lasers Eng. 48(2), 149–158 (2010). [CrossRef]

2. J. Xue, Q. Zhang, C Li, et al., “3D Face Profilometry Based on Galvanometer Scanner with Infrared Fringe Projection in High Speed,” Appl. Sci. 9(7), 1458 (2019). [CrossRef]

3. J. Zhang, W. Guo, Z. Wu, et al., “Three-dimensional shape measurement based on speckle-embedded fringe patterns and wrapped phase-to-height lookup table,” Opt. Rev. 28(2), 227–238 (2021). [CrossRef]

4. T. Mitsuo, I. Hideki, and K. Seiji, “Fourier-transform method of fringe-pattern analysis for computer-based topography and interferometry,” J. Opt. Soc. Am. 72(1), 156–160 (1982). [CrossRef]

5. K. Qian, “Two-dimensional windowed Fourier transform for fringe pattern analysis: Principles, applications and implementations,” Opt. Lasers Eng. 45(2), 304–317 (2007). [CrossRef]

6. A. Dursun, S. Ozdar, and FN. Ecevit, “Continuous wavelet transform analysis of projected fringe patterns,” Meas. Sci. Technol. 15(9), 1769 (2004). [CrossRef]

7. K. Qian, “Windowed Fourier transform for fringe pattern analysis,” Appl. Opt. 43(13), 2695–2702 (2004). [CrossRef]

8. L. Liu, L. Yang, X. Chu, et al., “A novel phase unwrapping method for binocular structured light 3D reconstruction based on deep learning,” Optik 279, 170727 (2023). [CrossRef]

9. V. Srinivasan, H. Liu, and M. Halioua, “Automated phase-measuring profilometry of 3D diffuse objects,” Appl. Opt. 23(18), 3105–3108 (1984). [CrossRef]

10. C. Zuo, S. Feng, and L. Huang, “Phase shifting algorithms for fringe projection profilometry: A review,” Opt. Lasers Eng. 109, 23–59 (2018). [CrossRef]

11. B. Chen and S. Zhang, “High-quality 3D shape measurement using saturated fringe patterns,” Opt. Lasers Eng. 87, 83–89 (2016). [CrossRef]

12. C. Zuo, K. Qian, S. Feng, et al., “Deep learning in optical metrology: a review,” Light: Sci. Appl. 11(1), 39–54 (2022). [CrossRef]

13. K. Wang, L. Song, C. Wang, et al., “On the use of deep learning for phase recovery,” Light: Sci. Appl. 13(1), 4 (2024). [CrossRef]

14. K. Wang, K. Qian, J. Di, et al., “Deep learning spatial phase unwrapping: a comparative review,” APN. 1(01), 014001 (2022). [CrossRef]

15. K. Yan, Y. Yu, C. Huang, et al., “Fringe pattern denoising based on deep learning,” Opt. Commun. 437, 148–152 (2019). [CrossRef]

16. B. Lin, S. Fu, C. Zhang, et al., “Optical fringe patterns filtering based on multi-stage convolution neural network,” Opt. Lasers Eng. 126, 105853 (2020). [CrossRef]

17. E. Aguenounon, J. Smith, M. Al-Taher, et al., “Real-time, wide-field and high-quality single snapshot imaging of optical properties with profile correction using deep learning,” Biomed. Opt. Express 11(10), 5701–5716 (2020). [CrossRef]

18. J. Tan, W. Su, Z. He, et al., “Deep learning-based method for non-uniform motion-induced error reduction in dynamic microscopic 3D shape measurement,” Opt. Express 30(14), 24245–24260 (2022). [CrossRef]

19. H. Nguyen, N. Dunne, and H. Li, “Real-time 3D shape measurement using 3LCD projection and deep machine learning,” Appl. Opt. 58(26), 7100–7109 (2019). [CrossRef]

20. J. Qian, S. Feng, Y. Li, et al., “Single-shot absolute 3D shape measurement with deep-learning-based color fringe projection profilometry,” Opt. Lett. 45(7), 1842–1845 (2020). [CrossRef]

21. Y. Li, J. Qian, S. Feng, et al., “Deep-learning-enabled dual-frequency composite fringe projection profilometry for single-shot absolute 3D shape measurement,” Opto-Electron. Adv. 5(5), 210021 (2022). [CrossRef]

22. S. Van and J. J. J. Dirckx, “Deep neural networks for single shot structured light profilometry,” Opt. Express 27(12), 17091–17101 (2019). [CrossRef]

23. Y. Zheng, S. Wang, Q. Li, et al., “Fringe projection profilometry by conducting deep learning from its digital twin,” Opt. Express 28(24), 36568–36583 (2020). [CrossRef]

24. W. Li, J. Yu, S. Gai, et al., “Absolute phase retrieval for a single-shot fringe projection profilometry based on deep learning,” Opt. Eng. 60(06), 064104 (2021). [CrossRef]

25. H. Nguyen, E. Novak, and Z. Wang, “Accurate 3D reconstruction via fringe-to-phase network,” Measurement 190, 110663 (2022). [CrossRef]

26. P. Yao, S. Gai, Y. Chen, et al., “A multi-code 3D measurement technique based on deep learning,” Opt. Lasers Eng. 143, 106623 (2021). [CrossRef]

27. G. E. Spoorthi, S. Gorthi, and R. K. S. S. Gorthi, “PhaseNet: A Deep Convolutional Neural Network for Two-Dimensional Phase Unwrapping,” IEEE Signal Process. Lett. 26(1), 54–58 (2019). [CrossRef]

28. J Zhao, L. Liu, T. Wang, et al., “VDE-Net: a two-stage deep learning method for phase unwrapping,” Opt. Express 30(22), 39794–39815 (2022). [CrossRef]

29. J. Zhang and Q. Li, “EESANet: edge-enhanced self-attention network for two-dimensional phase unwrapping,” Opt. Express 30(7), 10470–10490 (2022). [CrossRef]

30. W. Yin, Q. Chen, S. Feng, et al., “Temporal phase unwrapping using deep learning,” Sci. Rep. 9(1), 20175 (2019). [CrossRef]

31. K. Wang, Y. Li, K. Qian, et al., “One-step robust deep learning phase unwrapping,” Opt. Express 27(10), 15100–15115 (2019). [CrossRef]

32. T. Yan, W. Chen, X. Su, et al., “Neural network applied to reconstruction of complex objects based on fringe projection,” Opt. Commun. 278(2), 274–278 (2007). [CrossRef]

33. S. Feng, Q. Chen, G. Gu, et al., “Fringe pattern analysis using deep learning,” Adv. Photon. 1(02), 1 (2019). [CrossRef]

34. S. Feng, C. Zuo, W. Yin, et al., “Micro deep learning profilometry for high-speed 3D surface imaging,” Opt. Lasers Eng. 121, 416–427 (2019). [CrossRef]

35. T. Yang, Z. Zhang, H. Li, et al., “Single-shot phase extraction for fringe projection profilometry using deep convolutional generative adversarial network,” Meas. Sci. Technol. 32(1), 015007 (2020). [CrossRef]

36. Y. Li, J. Qian, S. Feng, et al., “Single-shot spatial frequency multiplex fringe pattern for phase unwrapping using deep learning,” in Proceedings of SPIE, Optics Frontier Online 2020:Optics Imaging and Display(SPIE, 2020) pp. 11571.

37. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Proceedings of 18th International Conference on Medical Image Computing and Computer-assisted Intervention(MICCAI, 2015), pp. 234–241.

38. J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR, 2018), pp. 7132–7141.

39. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of 3rd International Conference on Learning Representations(ICLR, 2015), pp. 149801.

Method	FSCNet	SE-FSCNet
MSE(rad)	0.0993	0.0935
MAE(rad)	0.0368	0.0347
SSIM	0.9577	0.9608

Method	FT	SE-FSCNet
MSE(rad)	0.5590	0.1386
MAE(rad)	0.2105	0.0534
SSIM	0.8125	0.9405

Method	U-Net	SE-FSCNet
MSE(rad)	0.1392	0.1236
MAE(rad)	0.0516	0.0475
SSIM	0.9450	0.9478

Method	FSCNet	SE-FSCNet
MSE(rad)	0.0993	0.0935
MAE(rad)	0.0368	0.0347
SSIM	0.9577	0.9608

Method	FT	SE-FSCNet
MSE(rad)	0.5590	0.1386
MAE(rad)	0.2105	0.0534
SSIM	0.8125	0.9405

SE-FSCNet: full-scale connection network for single-shot phase demodulation

Abstract

1. Introduction

2. Theories and methods

2.1 Multi-step phase-shifting method and dataset construction

2.2 Design of the proposed full-scale connection network architecture

2.3 Comparative analysis of intermediate features of SE-FSCNet and U-Net

3. Experiments

3.1 Environment and training hyperparameter settings

3.2 Ablation experiment, feasibility test and accuracy testing

4. Summary and discussion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (12)

Tables (5)

Equations (2)

Optics Express

Layer		Filters	Kernel size	Stride	Activation	Output size
	input	-	-	-	-	480 × 640 × 1
E1	conv	64	3 × 3	1	ReLU	480 × 640 × 64
-	conv	64	3 × 3	1	ReLU	480 × 640 × 64
-	max pool	-	2 × 2	2	-	240 × 320 × 64
E2	conv	128	3 × 3	1	ReLU	240 × 320 × 128
-	conv	128	3 × 3	1	ReLU	240 × 320 × 128
-	max pool	-	2 × 2	2	-	120 × 160 × 128
E3	conv	256	3 × 3	1	ReLU	120 × 160 × 256
-	conv	256	3 × 3	1	ReLU	120 × 160 × 256
-	max pool	-	2 × 2	2	-	60 × 80 × 256
E4	conv	512	3 × 3	1	ReLU	60 × 80 × 512
-	conv	512	3 × 3	1	ReLU	60 × 80 × 512
-	max pool	-	2 × 2	2	-	30 × 40 × 512
E5	conv	1024	3 × 3	1	ReLU	30 × 40 × 1024
-	conv	1024	3 × 3	1	ReLU	30 × 40 × 1024
-	dropout	-	-	-	-	30 × 40 × 1024
FA Module	…	-	-	-	-	-
D4	conv	320	3 × 3	1	ReLU	60 × 80 × 320
FA Module	…	-	-	-	-	-
D3	conv	320	3 × 3	1	ReLU	120 × 160 × 320
FA Module	…	-	-	-	-	-
-	conv	320	3 × 3	1	ReLU	240 × 320 × 320
FA Module	…	-	-	-	-	-
SE Module	…	-	-	-	-	-
D1	conv	320	3 × 3	1	ReLU	480 × 640 × 320
-	conv	320	3 × 3	1	ReLU	480 × 640 × 320
Network output	conv	2	3 × 3	1	-	480 × 640 × 2