Accurate 3D reconstruction of single-frame speckle-encoded textureless surfaces based on densely connected stereo matching network

Ruike Wang; Pei Zhou; Pei Zhou; Jiangping Zhu; Jiangping Zhu

doi:10.1364/OE.486031

1. Introduction

The stereo vision imitates the principle of human vision by finding matching points between perspective views and restores 3D space according to the principle of triangulation. In many cases the stereo matching is one of critical factors that influences the accuracy of 3D reconstruction, while the general stereo matching task has poor matching performance in textureless regions. As a dominant method of structured light projection based methods, SPP has been extensively applied in the binocular stereo vision for 3D measurement, which enriches target texture information by additionally projecting one or more than one speckle pattern(s). Obviously, it is easier and more feasible to overcome the problem of low stereo matching accuracy and achieve better 3D reconstruction, especially dynamic 3D imaging with single-frame speckle pattern projection. Evidently, the advantages of simple configuration and high measurement accuracy for such an active 3D measurement technique naturally make it extremely popular in various application scenarios [1–13].

Sum of absolute differences (SAD), normalized cross correlation (NCC), zero-normalized cross correlation (ZNCC) and census are common cross correlation based stereo matching algorithms [14,15]. For these traditional algorithms, the matching cost is calculated by performing block matching to obtain integer pixel disparity map, and then subpixel disparity map is obtained by interpolation. SAD is mostly employed for preliminary screening for its high speed but low precision. Besides, ZNCC is more robust than NCC because it subtracts the mean value of grayscale within the matched sub-window, and is more resistant to changes in environmental light. The operation of only comparing the gray value of the center with the surrounding pixels in census degrades the matching accuracy. Yet SGM [16] is a combination of the advantages of global algorithm and local algorithm which employs the one-dimensional path aggregation strategy to replace the two-dimensional minimization in the global algorithm, leading to great improvement in efficiency.

However, when these methods are directly introduced in SPP, the matching accuracy of these methods can not be markedly improved unless multi-frame speckle patterns are projected and more constraints are utilized at cost of measurement efficiency [17]. Generally, it is still extremely challenging to achieve satisfactory matching result via single frame speckle projection alone despite its good application prospect in dynamic 3D imaging. Therefore, there is still a lack of robust stereo matching solution for accurate 3D measurement through single frame speckle projection.

Recently, most of the matching works [18–21] are implemented by DL, showing better results than traditional methods. They have a similar procedure to traditional methods, including feature extraction, cost volume construction, cost aggregation and disparity regression. Similarly, feature extraction in neural networks is equivalent to cost calculation in traditional algorithms. Jure et al. [18] implemented the matching cost calculation via convolutional neural network (CNN) and then utilized the traditional algorithm for other steps. Especially, Chang et al. [19] improved the matching accuracy via the spatial pyramid pooling in feature extraction module and different levels of feature maps were concatenated as the final feature maps. Xu et al. [20] carried out rich operations to establish the cost volume, but only a few layers of residual modules in the feature extraction module are simply performed. Even if the cat operation is performed on multiple layers at the end, the feature information failed to be fully extracted, which will affect the network performance. Yin et al. [21] adopted the feature extraction module similar to Ref [19]., performed multi-scale pooling operation, and spliced the obtained features of different sizes. However, features being not sufficiently downsampled in the network lead to large size features and prejudice to enlarge receptive field, which degraded the feature extraction ability.

As for the procedure of 3D cost volume construction, Luo et al. [22] and Mayer et al [23]. calculated the feature correlation of stereo images to construct the volume. There are also some works [24] that directly concatenated the features to construct a 4D cost volume. However, the above correlation and concatenation based methods have their own shortcomings. The former is more efficient but at cost of losing useful information, while the latter gets more complete information but requires a cumbersome number of parameters. Guo et al. [25] proposed to combine the two categories of methods. Although this operation increases network parameters, it achieves an improvement in accuracy. For example, Wang et al. [26] adopted an iterative patch match instead of the widely used traditional constructing volume method to predict disparity through continuous convolution. While for the feature extraction module, they still implemented multi-scale feature extractor to extract feature information for remaining steps, which demonstrated that the feasibility of multi-scale feature extraction.

Aiming to solve the problem of unsatisfactory matching result of single frame speckle encoding texture-less targets for existing algorithms and extend its application in high accuracy dynamic 3D imaging, we propose a stereo matching network called Densely Connected Stereo Matching Network (DCSM) for single frame speckle pattern encoding 3D target in this work. The DCSM network architecture consists of four modules, including a densely connected feature extraction module, a cost volume construction module, a 3D CNN module and a disparity regression module. Our early investigation has concluded that the ability of the feature extraction has a great influence on the information transmitting effect of the network. Consequently, we deploy the densely connected construction in DCSM network by drawing on the PSM’s [19] multi-scale approach in the feature extraction module to extract richer features. Each layer in the module accepts all shallow features and these features are concatenated, then passed to the next layer to ensure the effective transmission of features. And the attention weight parameters are added into the cost volume construction to maximize the utilization of feature information.

For the sake of verifying our idea, we establish a real 3D measurement system and its digital twin to construct training and test datasets. The overall flow chart is shown in the Fig. 1, which can be outlined the following three main steps:

Fig. 1. The overall flow chart of this work.

Download Full Size | PDF

Step1 A real measurement system and its digital twin with Blender are established. In the real 3D measurement system, the system calibration is conducted to obtain intrinsic and extrinsic parameters and then they are imported into the digital twin 3D system. The digital 3D simulation system makes it easier for us to construct rich training datasets.

Step2 We construct the SPP dataset under both systems. We project speckle patterns [27,28] on the targets, and the captured stereo images are grouped into training data and test data. Projecting three frames of speckle is only to increase the richness of the data, while only a single frame speckle is as the network input. At the same time, FPP [12,13] is introduced to project phase-shifting fringe patterns to the target to be tested. The phase information is extracted by three frequency temporal phase unwrapping (TPU) algorithm [12,29] and the disparity is obtained by phase matching as the GT of training and testing data.

Step3 Stereo speckle images are input into the DCSM network to train and test the proposed stereo matching network. We conduct multiple sets of comparative tests on various models to prove the generalization of the proposed DCSM network. The 3D geometry can be reconstructed with the output disparity maps of the network and the known system parameters.

2. Principles

2.1 Speckle projection profilometry (SPP)

The passive binocular stereo vision through stereo matching under the ambient light illumination to obtain depth information heavily depends on the feature information of the target surface itself. However, the weak texture on the surface of object inevitably leads to difficulty in the stereo matching task, and gives a low reconstruction accuracy. Fortunately, speckle structured pattern with rich anisotropic high-frequency characteristics that has been extensively adopted to enrich the surface texture of object in order to obtain more accurate disparity. Our group [30] proposed a novel 3D surface profile measurement scheme via only a single-shot color binary speckle pattern (CBSP) and a temporal-spatial correlation matching algorithm, which can be applied to the measurement of dynamic and static objects. In this paper, we introduce the same generation strategy of numerical random speckle pattern in Ref [28].

2.2 Fringe projection profilometry (FPP)

While choosing the distorted speckle stereo image pairs as the training data of the network, we need to construct reasonable and reliable GT data. FPP [31] has been considered as one of the most reliable and accurate 3D measurement technique. This technique projects a sequence of sinusoidal fringe patterns onto the object surface, and retrieves the phase information from the deformed fringe images collected by the cameras. Finally, a pair of continuous phase maps can accurately determine the disparity map of the tested object, which can be treated as the GT of the SPP dataset. The phase shifting offset between every two adjacent fringe patterns is 2π/N, and then the distorted fringe images captured by the cameras are expressed as:

(1)$${I_i}({x,y} )= A({x,y} )+ B({x,y} )\cos \left( {\varphi ({x,y} )- \frac{{2\pi i}}{N}} \right)$$

where (x, y) are the pixel coordinate, ${I_i}({x,y} )$ represents the intensity of captured images, A(x,y) represents the average intensity, B(x,y) represents the modulation, φ(x, y) represents the phase to be acquired, and i represents the phase shifting (i = 1,2,…,N). The wrapped phase φ(x, y) can be described as:

(2)$$\varphi ({x,y} )= \arctan \frac{{\sum\limits_{i = 1}^N {{I_i}({x,y} )\sin \frac{{2\pi i}}{N}} }}{{\sum\limits_{i = 1}^N {{I_i}({x,y} )\cos \frac{{2\pi i}}{N}} }}$$

The phase value φ(x, y) is wrapped in a range of -π to π and cannot correctly reflect the phase information of the measured object. Here, multi-frequency TPU algorithm [12,29] is adopted to unwrap φ(x, y) to the continuous phase $\phi ({x,y} )$.

(3)$$\phi ({x,y} )= \varphi ({x,y} )+ 2\pi k({x,y} )$$

where k represents the fringe order determined by the TPU algorithm.

2.3 Establishment of 3D measurement system

Considering that the research objective of this paper is to solve the problem of insufficient accuracy of single frame speckle matching of traditional algorithms and DL-based algorithms, it is necessary to use speckle data as network data support. While existing public datasets such as KITTI and Sceneflow are not speckle images which are not suitable for testing our ideas, so we need to reconstruct the speckle data set. The existing public datasets such as KITTI and Sceneflow are not suitable for verifying our idea which make us construct speckle datasets. We firstly establish a real structured light illumination 3D measurement system, as shown in Fig. 2, including a pair of binocular industrial cameras and a projector. The projector is applied to project three speckle patterns and fringe patterns of low, medium and high frequencies on the target to be measured. In this paper, fringe patterns of three frequencies (1, 8, 64) are projected on the tested surface, whose phase shifting steps N are set to 4, 4, and 10, respectively. The pair of cameras simultaneously capture the corresponding distorted speckle and fringe images under the same scenario from left and right perspectives.

Fig. 2. The diagram of real 3D measurement system for data construction.

Download Full Size | PDF

As we all know, reasonable and correct data has played a vital role in the data-driven DL methods. However, it is practically difficult and unpractical to construct such a rich 3D dataset. Fortunately, the advanced computer technology makes it possible to construct such simulation datasets [32–34].

Blender is a free and open source 3D graphics and image software that offers a series of functions from modeling, animation, material and rendering, etc. Therefore, it is believable that if the 3D measurement system of the real scene is established via Blender software, it will be easier to obtain rich 3D data. To make the data simulated by Blender closer to the real 3D system, we plan to set the parameters in Blender that are exactly the same as that of the real 3D measurement system. The intrinsic and extrinsic parameters involved from the real system shown in Table 1 is adopted to establish the digital 3D system [33,35].

Table 1. Calibration parameters of the real 3D system

View Table | View all tables in this article

After we placed a pair of virtual cameras and a virtual projector in Blender, and the parameters representing the virtual cameras transformed from Table 1.

f_x and f_y are the focal length in X-axis and Y-axis, u₀ and v₀ represent the principal point coordinates, k₁∼k₃ represent the coefficients of radial distortion, p₁ and p₂ represent the coefficients of tangential distortion, R and T denote the rotation and the translation matrixes, respectively. Firstly, T represents the translation of the left camera coordinate system to the right camera coordinate system, therefore we calculate the baseline by taking the sum of the squares and the square root of T. And then we convert R into the rotation angles of the XYZ axis for virtual cameras by Rodrigues’s formula. We set the above parameters for the pair of virtual cameras.

As shown in Fig. 3(a), a virtual projector is located between two virtual cameras, and a 3D model is placed from the working distance(800-1100 mm) of the system. In Fig. 3(b),we set the virtual cameras’ focal length, pixel size, resolution and other parameters according to the intrinsic parameters. The virtual projector can ‘project’ speckle pattern and a series of fringe patterns as described in Sec. 2.1 and Sec.2.2 onto the 3D model, and the virtual cameras synchronously ‘capture’ the patterns ‘distorted’ by the ‘tested’ 3D model.

Fig. 3. (a) The digital 3D system in Blender (Version 2.93.2); (b) simulation parameters transformed from Table 1.

Download Full Size | PDF

In the meanwhile, Blender's projector parameters are set according to the parameters of the real projector. In the real system, the projector is located between the binocular cameras with the projection resolution 912 × 1140, and the focus position is adjusted at the position of the projected object. In the Blender digital twin system, we restore the 3D measurement scene following the Settings under the real system. Firstly, as shown in Fig. 4(a), projector light source is placed between the binocular cameras, and the mapping position of its projection image is adjusted by setting the x and y of the mapping position to 0.5 m respectively, so that the speckle pattern and fringe pattern cover the tested target model area. Then, the projection mode is set to the same scale as the real projector, and the scaling parameters y and z are set to 1.114 and 0.912, respectively, as shown in Fig. 4 (a). In addition, in order to ensure that the speckle pattern and fringe pattern projected onto the target model position are in focusing state, we set the radius of the light as 0 m, as shown in Fig. 4(b). The demonstration of speckle pattern and fringe pattern projected onto the target model is shown in Fig. 4(c).

Fig. 4. Projector setup in Blender. (a) The shading node tree of condtruction a projector; (b) projector parameters; (c) projection effects in Blender.

Download Full Size | PDF

2.4 Construction of the SPP dataset

In the real system, Constellations, Chinese zodiac masks of five poses (front, left, right, head down and head up) are set as the training objects, and plaster models in different degrees are set as the tested objects, as shown in Fig. 5. These poses are used to imitate the unfixed angles of the object being measured in real applications, which can enhance the generalization ability of the network. Meanwhile, three frames of randomly generated speckle patterns [30] and 18 fringe patterns are projected onto the tested target and the left and right cameras are synchronized to capture the scene. The purpose of setting three frames of speckle patterns is only to enrich the data, while only one pair of single frame speckle image is required as input each time during network training. The projection of three different speckle patterns is an extension of data volume. The three pairs of speckle images captured synchronously during the projection of the three-frame speckle pattern have different speckle distributions but with the same generation manner, which can be considered as three pairs of different data. After processing, they are all used as training sets. Compared with the projection of the single-frame speckle pattern during the construction of data, the data size changes to three times.

Fig. 5. Partial model types and poses. From left to right, the images in (a) and (b) are respectively offset frontward, by 30 degrees leftward, 30 degrees rightward, 20 degrees downward, and 20 degrees upward. And the images in (c) are respectively offset frontward, by 30 degrees leftward, 40 degrees leftward, 30 degrees rightward, and 40 degrees rightward.

Download Full Size | PDF

The background of speckle images captured in the real system is beforehand removed by calculating modulation of fringe images. Although the black curtain is set as the background in the shooting process to reduce the interference of scene information as much as possible, it still cannot avoid the existence of some invalid information outside the area of the measured target. Therefore, it is necessary to remove the background of the collected speckle images to prevent the interference of the results. The fringe images captured synchronously with the speckle images have the same scene information, so the background can be removed by calculating the fringe modulation. The modulation can be used to evaluate the quality of fringe images. The quality of fringe images is good and the modulation value is high in the valid region, whereas in the invalid region such as the background, the quality of fringe images is poor and the modulation value is low. Therefore, by setting a threshold, the poor quality region below the threshold is set to 0, so as to generate a mask to remove the background according to the valid region. After testing during the experiment, the threshold was set to 0.05. A total of 18 frames of fringe patterns were projected in the experiment, and the calculation process of modulation m was shown as:

(4)$$m[i ]= \frac{{2\ast \sqrt {{a^2} + {b^2}} }}{{18}}$$

(5)$$a +{=} \sin \frac{{2\pi \ast n}}{{18}}\ast n[i ]$$

(6)$$b +{=} \cos \frac{{2\pi \ast n}}{{18}}\ast n[i ]$$

where i represents the pixel point of the image, $m[i ]$ is the modulation of pixel i, n is the number of fringe images and with the range in [1,18].

Then the epipolar-rectified speckle images (resolution of 1280 × 1024) are cropped into smaller sub-images (resolution of 1280 × 256) as shown in Fig. 1. These processed sub-images are served as training data. The accurate disparity map of the corresponding speckle images is obtained by the method mentioned in Sec.2.2, which is cropped in the same way as the speckle images and employed as GT.

In the digital 3D simulation system, we also change each 3D model with multiple poses like the data settings in the real 3D system. The actual images will suffer noise interference due to factors caused by the environment and the cameras. To ensure that the simulated images are as close to the real situation as possible, we firstly add Gaussian noise with mean 0 and variance 0.005 to the ‘captured’ simulated speckle images. Then we perform the same background removing and epipolar rectification as the real data. As for the acquired simulation fringe images we also follow the implementation of the real fringe images.

Through the above procedures, we gain approximately 1725 pairs speckle images and their corresponding disparity. These speckles images are divided into 1660 pairs and 65 pairs as training data and testing data,respectively. The training data contains 1060 pairs real speckle images and 600 pairs simulation speckle images, while the testing data are all real speckle images.

3. Densely connected stereo matching network

Here, a stereo matching network called DCSM for single-frame speckle pattern encoding 3D target is constructed on the basis of the classical stereo matching network PSM [19]. In this paper, we draw on the classic idea of PSM: (1) The multi-scale feature extraction and stacked hourglass structure of PSM has obvious effects on combination of global and local information and prevention of the loss of detailed information; (2) Additionally, the idea of our proposed dense connection is adopted in the feature extraction module to cascade the shallow features to the deep features to prevent the loss of features and obtain more details; And (3) the attention weight parameter is added into the cost volume construction part to maximize the function of feature information. Figure 6 illustrates the network structure diagram of the proposed DCSM network.

Fig. 6. The network structure diagram of our proposed DCSM network.

Download Full Size | PDF

3.1 Network architecture

First, let's describe how to design the network structure according to the dataset characteristics and task objectives. The data objects of the speckle dataset constructed in this paper are human faces as well as mask and plaster models with facial features. Different from the data in large scenes, the data set in this paper has more details and there is a parallax jump within the target. In addition, task objectives also require clear boundaries of the results of the algorithm. Based on the above elements, the neural network's sufficient extraction of features and the retention of detailed information is particularly important. Stereo matching networks consist of four modules, including feature extraction module, a volume construction module, a 3D CNN module and a disparity regression module. According to the study of related work, the feature extraction module largely determines the result accuracy, and it is also the best position to extract detailed information in the shallow level. We construct a DCSM network combined with SPP, which has stronger accuracy and generalization than other networks through dense connections between subsequent layers. We integrate the idea of dense connection into the feature extraction module, and construct several dense blocks, and each layer of features in dense block is directly connected with all the features in the shallow layer. It can prevent the loss of features caused by the convolution operation, and retain more effective information. In addition, multi-scale feature extraction is adopted to integrate the features extracted at multiple scales to better retain the detailed information of different scales.

When a pair of speckle images is input into the network, accurate feature map is extracted via the densely connected feature extraction module and is concatenated into a cost volume through attention weight, then is fed into 3D CNN for regularization. Finally, regression is implemented to obtain the final output disparity map. The following paragraphs provide a detailed description of each module.

A. Densely connected feature extraction module

Feature extractor reduces the dimensionality of original data and extracts important information for subsequent procedures, which largely determines the matching accuracy of the stereo matching network. In this paper, the densely connected structure [36] is introduced for feature extraction based on the observation that deep networks can maximize the transfer of feature information to the deep layers. We construct the three-level dense blocks in the feature extraction module. First, a pair of speckle images of size H × W first go through two 2D convolutions of 3 × 3 kernel with strides of 2 and 1, and the downsampled feature maps of 1/4H × 1/4W × 32 can be obtained. Then, three dense blocks [36] are followed to produce features at 1/16, 1/64 and 1/256 resolution. Finally, the outputs of the three dense blocks are upsampled to 1/16 resolution, and are concatenated them to form 1/4H × 1/4W × 320 feature map.

The features in each dense block have the same size and different number of channels, and dense connections are adopted in the block. Each group of dense layers in the block consists of a 1 × 1 convolution and a 3 × 3 convolution. The first 1 × 1 convolution layer can reduce data dimension computational consumption before the 3 × 3 convolution calculation, which can also improves the expressiveness of the network combined with the activation functions. The three dense layers in each block receive the feature information extracted by each previous layer, and concatenate them as their own input, to realize the effective transmission of feature information.

Considering the increases of network calculation complexity as the depth of feature layers, a 1 × 1 convolution and a pool are utilized to reduce the channels immediately and downsampled the resolution to 1/4 of the block input. In this way, next dense block can receive more lightweight feature information.

B. Cost volume construction

Since the concatenation operation does not contain feature similarity information, so more parameters are required to learn similarity in the following aggregation network. Guo et al [25]. combined concatenation and correlation based methods to construct cost volume. Following the idea of ACV [20] in the meanwhile, we generate attention weights from correlation methods to suppress redundant information and enhance matching-related information for concatenation volume. We first concatenate the left and right images to obtain a cost volume of dimension 1/4W × 1/4H × 32, then perform the correlation operation on the left and right features to construct a corresponding cost volume of 1/4H × 1/4W × 32. Further we process the correlation volume from coarse to fine to obtain weight information. After three 3D convolution groups, the number of channels is reduced from 32 to 16, and an hourglass module integrated with the attention mechanism is introduced [25], followed by a convolution layer to compress the number of channels to 1 to obtain the weight matrix w. This process gradually narrows the parallax range, reducing volume build memory and computing parameters. Finally, the concatenation volume and the weight matrix w are filtered to eliminate redundant information in the initial cost volume and enhance the expressive ability. The final volume at each channel is computed as,

(7)$$Volume = w \odot Volum{e_{concat}}$$

where ⊙ represents the element-wise product, and the $Volum{e_{concat}}$ represents the initial concatenation volume.

C. 3D CNN

The 3D CNN will further optimize the 4D volume through two sets of 3D convolutions as 1/4H × 1/4W × 1/4D × 32 before three hourglass modules are employed to obtain three outputs. It is an encoder-decoder framework composed of four convolutional layers and two deconvolutional layers. Undergoing upsampling and disparity regression, each hourglass module is processed into a H × W tensor, which is three output features. The second and third outputs are added into the previous hourglass’s output, integrating richer feature information without adding too many skips, as shown in Fig. 6.

D. Disparity regression

Our continuous disparity map is estimated using disparity regression as described in Ref [24], which can offer better performance than classification-based stereo matching algorithms. Based on the predicted cost volume, the normalized probability pd of each disparity level d can be calculated using the Softmax operation [22]. Then, the predicted disparity is yielded by summing each disparity d and weighting them by the probability pd as follows:

(8)$${D_p} = \sum\limits_{d = 0}^{{d_{\max }}} {d \times {p_d}} $$

where d represents disparity level, ${p_d}$ represents the normalized probability at disparity level d.

3.2 Loss function

For our network, the loss function of the network is established as follows:

(9)$$L = \sum\limits_{i = 1}^3 {{w_i} \times Smooth{L_1}({{D_{gt}} - {D_{p\textrm{i}}}} )} $$

where, ${D_{gt}}$ is the GT of disparity, and SmoothL1 is widely adopted in the regression task of object detection due to its robustness and low sensitivity to outliers. D_pi are the three output features mentioned in Sec.3.1, w_i are their weights 0.5, 0.7 and 1,respectively. The calculation formula of SmoothL1 is given by,

(10)$$Smooth{L_1} = \left\{ {\begin{array}{c} {0.5{x^2},if|x |< 1}\\ {|x |- 0.5,otherwise} \end{array}} \right.$$

4. Experimental results and discussion

As shown in Fig. 7, we establish an experimental 3D system, including a pair of Daheng industrial cameras (Model MER-131-210U3M, resolution of 1280 × 1024 pixels, focal length 16 mm), a DLP light field projection module (Model Light Crafter 4500, resolution of 912 × 1140 pixels), and a computer. We set the baseline distance to about 400 mm and place the target to be tested at a distance of 800-1100 mm from the system. The network architecture is implemented on the platfrom of the PyTorch framework and deploy our network to run on NVIDIA RTX 3090 GPUs. The number of training sets used in the training process is approximately 1660 pairs, the batchsize is set to 4, the network parameter maxdisp is set to 256, the initial learning rate is set to 0.001, and the network can converge when the epoch is 150. In this paper, End-Point-Error (EPE) and N-Pixel-Error are selected as disparity map evaluation indicators. Mean distance (Mean_dis.), Standard deviation (Std_dev.) and Root Mean Square Error (RMSE) are selected as 3D evaluation indicators.

Fig. 7. The experimental 3D system.

Download Full Size | PDF

To demonstrate the advantages of our solution, we compare it with PSM [19], ACV [20], Yin's Net [21], AMSN [37], SGM [16] and ZNCC [15]. The same parameters with the same maxdisp and loss function are deployed for these network-based methods. We conducted multiple sets of comparative tests on the models which didn’t participate in the training to prove the generalization of the proposed DCSM network.

4.1 Accuracy evaluation

First, we select a new mask model that doesn’t participate in training for testing. Figure 8 shows the test results of different algorithms in terms of disparity map. The first and second rows are the GT and the predicted disparity map of each algorithm, respectively. Several network-based methods, including Ours, PSM [19], Yin's Net [21] and AMSN [37], are capable of predicting the disparity more completely, while ACV [20] has some error areas on the upper edge of the mask. The two traditional algorithms, SGM [16] and ZNCC [15], fail to obtain a relatively complete disparity map accompanied by many incorrectly matched pixels in the mask. The third and fourth rows are 0.5-Pixel-Error maps corresponding to each algorithm. The white areas are all pixels with an error greater than 0.5, which more intuitively shows the performance of each algorithm. The error maps have a certain degree of error in the parallax jump area, such as the edge of the mask. It is obviously found that there is the least white area across our model, indicating that our method performs the best with the smallest error.

Fig. 8. Disparity maps for different methods and their corresponding 0.5-Pixel-Error distributions.

Download Full Size | PDF

Table 2 lists the N-Pixel-Error and EPE, which are respectively defined as follows:

(11)$$N - Pixel - Error = \frac{1}{n}\sum\limits_{i = 1}^n {\left[ {({|{pr{e_i} - g{t_i}} |> N} )\& \left( {\frac{{|{pr{e_i} - g{t_i}} |}}{{g{t_i}}} > p} \right)} \right]} $$

(12)$$EPE = \frac{1}{n}\sum\limits_{i = 1}^n {|{pr{e_i} - g{t_i}} |} $$

where n represents the total number of pixels, $pr{e_i}$ is the predicted disparity, and $g{t_i}$ is the GT, p is the parameter for N-Pixel-Error calculation, when N=0.5, p=0.01, when N=1, p=0.02.

Table 2. Error analysis of different methods.

View Table | View all tables in this article

The 0.5-Pixel-Error and 1-Pixel-Error of our method are 6.41% and 1.52%, respectively. Compared to PSM [19], 0.5-Pixel-Error for our method reduced by 11.2% while it is reduced by 44.3% in comparison with other network-based methods such as ACV [20]. Overall, the network-based methods show marked advantages over the traditional correlation-based methods, nevertheless our method is the optimal in terms of three evaluation metrics.

Next, we reconstruct the 3D results(point cloud) of all the methods. The first and second rows in Fig. 9 are the GT of point cloud and the reconstructed 3D. The corresponding reconstruction errors of all algorithms, are respectively shown in the third and fourth rows.

Fig. 9. Point cloud models constructed by different methods and their point cloud errors.The first and second lines represent the point cloud and the third and fourth lines represent the point cloud errors.

Download Full Size | PDF

Similar to the disparity maps, ACV [20] presents obvious reconstruction error areas at the edges, and SGM [16] and ZNCC [15] fail to completely reconstruct the mask. The scope and distribution of errors of the corresponding error maps obviously shows that both of ours and PSM errors are smaller. A few layers of residual modules in the feature extraction module was simply performed in ACV [20], which can’t extract feature information sufficiently and affected the network performance. Yin's Net [21] adopted the feature extraction module similar to PSM [19], but a lightweight 3D convolutional network is constructed when processing 4D volumes, therefore the performance of Yin's Net [21] is slightly worse than that of PSM [19]. The quantitative evaluation results of these methods are shown in Table 3 in terms of Mean_dis., Std_dev. and RMSE. Three indicators are respectively defined as follows:

(13)$${Mean\_dis}. = \frac{{\sum\limits_{i = 1}^n {({es{t_i}({u,v} )- g{t_i}({u,v} )} )} }}{n}$$

(14)$$avg = \frac{{\sum\limits_{i = 1}^n {g{t_i}({u,v} )} }}{n}$$

(15)$${Std\_dev}. = \sqrt {\frac{{\sum\limits_{i = 1}^n {{{({es{t_i}({u,v} )- avg} )}^2}} }}{n}} $$

(16)$$RMSE = \sqrt {\frac{{\sum\limits_{i = 1}^n {{{({es{t_i}({u,v} )- g{t_i}({u,v} )} )}^2}} }}{n}} $$

where n represents the total number of pixels, $es{t_i}({u,v} )$ is the predicted 3D cloud point, and $g{t_i}({u,v} )$ is the GT of 3D cloud point.

Table 3. Error statistics of point clouds constructed by different methods(mm)

View Table | View all tables in this article

In order to verify the test results more clearly and quantitatively, we partially visualized the point cloud of the 500^th column, as shown in Fig. 10. All of compared algorithms can predict the disparity with the same trend approximating the GT, but there exist some distinguishable outliers in traditional methods (SGM [16] and ZNCC [15]) on the point cloud. We calculate RMSE and Peak to Valley(PV) as shown in Fig. 10, the error of our method fluctuates in a relatively smaller range with RMSE 0.14196 mm and PV 1.0162 mm. Our proposed DCSM network distinctly be provided with minimum RMSE up to 17.9% smaller than other network-based methods. In spite of 0.0315 mm(3%) difference with AMSN in terms of PV, our method only has several larger error values on the jump areas such as the eye edge and the face edge of the tested model. It can be found that the heights of the jump areas have a sudden change compared to the surrounding height, the error fluctuation range is concentrated as a whole, which can be proved by referring to Fig. 11.

Fig. 10. Point cloud reconstruction of different methods.

Download Full Size | PDF

Fig. 11. Point cloud error relative with GT in the 500th column of different methods.

Download Full Size | PDF

Finally, we evaluate the results of our method, PSM [19], ACV [20], Yin's Net [21], and AMSN [37] algorithms on testing data containing 65 pairs real speckle stereo images. The testing images contain three different types of models including Constellations masks, Chinese zodiac masks and plaster models as shown in Fig. 4. At the same time they have the same five poses mentioned in Sec. 2.4 and white textured surface features. The 0.5-Pixel-Error of these algorithms is recorded in Table 4, from which we can make a conclusion that our network has absolute advantage in terms of accuracy, and the 0.5-Pixel-Error is reduced by 14.3%∼33.4% compared with other network-based methods.

Table 4. 0.5-Pixel-Error of different algorithms on 65 pairs of real speckle stereo images

View Table | View all tables in this article

4.2 Generalization tests

In Sec. 4.1, we present the testing results for similar models as those involved in the training. To further verify the generalization of our method in form of 3D reconstruction, as shown in Fig. 12, we choose a model with a different 3D geometrical type from the training set, i.e., the plaster model for testing. It is worth noting that no matter in Sec. 4.1 or Sec. 4.2 experiments, all of tested models are not involved in the training.

Fig. 12. Comparison of different methods for multi-angle reconstruction. The images in (a), (b), (c) and (d) are respectively offset frontward, by 30° leftward, 30° rightward, and 40° rightward.

Download Full Size | PDF

To make the experiment more convincing, Fig. 12 shows the 3D reconstruction results and corresponding errors in different angles of network-based methods. Figure 12(b) is offset 40 degrees leftward relative to Fig. 12(a), and Fig. 12(c) and Fig. 12(d) are offset rightward by 30 and 40 degrees relative to Fig. 12(a), respectively. The quantitative evaluation results of these methods in terms of Mean_dis., Std_dev. and RMSE are given in Table 5. Angles a, b, c and d correspond to Fig. 12(a)-(d), respectively. From the tested results in Table 5, as a whole we can conclude that our method perform better in comparison with its competitors in terms of all evaluation indicators. Mean_dis. of our method is no more than 0.255 mm and has a maximum reduction of 18.3% compared with other network-based methods. Our method’s Std_dev. is at least about 0.211 mm at Angle b and is reduced by up to approximately 50% at Angle a and c. Similarly, the RMSE is reduced by approximately 26.4%. Despite the capacity of other network-based methods of reconstructing the 3D geometry, our method is verified to enjoy better generalization and more accurate 3D reconstruction result.

Table 5. Comparison of different methods for analysis of multi-angle reconstruction (mm).^a

View Table | View all tables in this article

4.3 Dynamic tests

Focusing on the stereo matching algorithm based on single-frame speckle-encoded, the proposed DCSM algorithm solves the problem that the traditional algorithm requires multi-frame speckle to achieve the ideal effect, and it also improves the accuracy compared with other DL-based algorithms. The implementation of single-frame speckle stereo matching enables 3D reconstruction in dynamic scenes.

An experiment with dynamic 3D reconstruction is performed in this section. Speckle images are collected matched with the motion process. After these data go through the data processing procedure described in section 2.4, the data are tested on Ours, PSM [19], ACV [20], Yin's Net [21], and AMSN [37] algorithms. Figure 13 represents the speckle images and the 3D point cloud of three selected poses reconstructed by the corresponding algorithm during the motion process. Among them, (a), (b), and (c) are the three different poses arranged in the order of capture during the movement. At the same time, we provide the video reconstructed by our algorithm under dynamic scenes (see Visualization 1).

Fig. 13. Dynamic experimental results. (a), (b) and (c) are the three selected poses in the capture process, respectively (see Visualization 1).

Download Full Size | PDF

According to Fig. 13, it is obvious that our proposed DCSM can completely and correctly reconstruct the target shape in the dynamic process, with clear boundaries and no singular values. However, PSM [19], Yin's Net [21] and AMSN [37] have more singular values and void areas on part of the models, which cannot reconstruct the target completely. Among them, PSM shows impressive reconstruction result in the initial stage of the object motion, that is, under the posture (a). As the motion amplitude gradually increases, a small hole begins to appear in the forehead area under the posture (b). And the range and number of voids under the posture (c) keep increasing. However, PSM [19] works slightly better compared with Yin's Net [21] and AMSN [37]. Yin's Net [21] begins to appear small areas of holes under posture (a), the result gradually deteriorunder under posture (b), and the cavity under the posture (c) already accounts for a third of the face around. The reconstruction effect of AMSN [37] under the three poses is not ideal, and the wrongly reconstructed regions have completely affected the discrimination of target. The reconstruction results of ACV [20] algorithm are different from the above reconstruction results of PSM [19], Yin's Net [21] and AMSN [37] algorithms. It can better recover the 3D morphology in the face area, but the boundary is extremely unclear. The foreground area merges with the background and has some singular points in the boundary part, which cannot successfully separate the effective area of the face from its reconstructed 3D point cloud model.

The experimental results demonstrate the reliability of the DCSM algorithm proposed in this paper, and the better performance against other algorithms. Most importantly, it also shows that our algorithm can effectively adapt to the dynamic scenario.

5. Conclusion

A novel 3D reconstruction method using Densely Connected Stereo Matching Network is proposed with outstanding generalization and high 3D measurement accuracy based on single-frame speckle data. In this paper, we introduce SPP and FPP to ensure data support and construct the real FPP and virtual FPP systems. Different from other network-based methods, we take advantage of the capability of feature extraction on accuracy to construct a stronger feature extraction network using a pair of speckle stereo images as network input. After extracting features by the densely connected feature extraction module, a cost volume is constructed in a weight-assisted way and then enters the 3D convolution aggregation module. Abundant experiments have verified that our network shows overwhelming superiority on its counterparts and traditional algorithms. In this case, our method also presents stronger generalization and higher accuracy undergoing the multi-type and multi-angle 3D reconstruction experiments, which indicates a good application prospect in dynamic 3D imaging via singe-frame speckle pattern projection. Furthermore, in addition to the generalization we have already verified for the data with different perspectives, we also find that it’s possible to achieve the validity of the data under different lighting effects and texture feature in the future work.

Funding

National Natural Science Foundation of China (62101364); Key Research and Development Program of Sichuan Province (2021YFG0195, 2022YFG0053); The central government guides local funds for science and technology development (2022ZYD0111); China Postdoctoral Science Foundation (2021M692260).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. X. Su and Q. Zhang, “Dynamic 3-D shape measurement method: A review,” Opt. Lasers Eng. 48(2), 191–204 (2010). [CrossRef]

2. J. Geng, “Structured-light 3D surface imaging: a tutorial,” Adv. Opt. Photonics 3(2), 128–160 (2011). [CrossRef]

3. S. Zhang, “High-speed 3D shape measurement with structured light methods: A review,” Opt. Lasers Eng. 106, 119–131 (2018). [CrossRef]

4. Z. Ma and S. Liu, “A review of 3D reconstruction techniques in civil engineering and their applications,” Adv. Eng. Inform. 37, 163–174 (2018). [CrossRef]

5. Z. Sun, Y. Jin, M. Duan, X. Fan, C. Zhu, and J. Zheng, “3-D Measurement Method for Multireflectivity Scenes Based on Nonlinear Fringe Projection Intensity Adjustment,” IEEE Trans. Instrum. Meas. 70, 1–14 (2021). [CrossRef]

6. Y. Guo, Z. Duan, Z. Zhang, H. Jing, S. An, and Z. You, “Fast and accurate 3D face reconstruction based on facial geometry constraints and fringe projection without phase unwrapping,” Opt. Lasers Eng. 159, 107216 (2022). [CrossRef]

7. H. Nguyen, T. Tran, Y. Wang, and Z. Wang, “Three-dimensional Shape Reconstruction from Single-shot Speckle Image Using Deep Convolutional Neural Networks,” Opt. Lasers Eng. 143, 106639 (2021). [CrossRef]

8. H. Du, X. Chen, J. Xi, C. Yu, and B. Zhao, “Development and Verification of a Novel Robot-Integrated Fringe Projection 3D Scanning System for Large-Scale Metrology,” Sensors 17(12), 2886 (2017). [CrossRef]

9. H. Wu, S. Yu, and X. Yu, “3D Measurement of Human Chest and Abdomen Surface Based on 3D Fourier Transform and Time Phase Unwrapping,” Sensors 20(4), 1091 (2020). [CrossRef]

10. J. Wang, Y. Zhou, and Y. Yang, “A novel and fast three-dimensional measurement technology for the objects surface with non-uniform reflection,” Results Phys. 16, 102878 (2020). [CrossRef]

11. S. Zhang, “Absolute phase retrieval methods for digital fringe projection profilometry: A review,” Opt. Lasers Eng. 107, 28–37 (2018). [CrossRef]

12. W. Yin, S. Feng, T. Tao, L. Huang, M. Trusiak, Q. Chen, and C. Zuo, “High-speed 3D shape measurement using the optimized composite fringe patterns and stereo-assisted structured light system,” Opt. Express 27(3), 2411–2431 (2019). [CrossRef]

13. Y. Li, J. Qian, S. Feng, Q. Chen, and C. Zuo, “Composite fringe projection deep learning profilometry for single-shot absolute 3D shape measurement,” Opt. Express 30(3), 3424–3442 (2022). [CrossRef]

14. Y. Chen, L. Yang, and Z. Wang, “Literature survey on stereo vision matching algorithms,” J. Graph. 41(5), 702–708 (2020).

15. B. Pan, H. Xie, and Z. Wang, “Equivalence of digital image correlation criteria for pattern matching,” Appl. Opt. 49(28), 5501–5509 (2010). [CrossRef]

16. H. Hirschmüller, “Stereo processing by semiglobal matching and mutual information,” IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2008). [CrossRef]

17. B. Harendt, M. Große, M. Schaffer, and R. Kowarschik, “3D shape measurement of static and moving objects with adaptive spatiotemporal correlation,” Appl. Opt. 53(31), 7507–7515 (2014). [CrossRef]

18. J. Zbontar and Y. LeCun, “Computing the Stereo Matching Cost with a Convolutional Neural Network,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1592–1599 (2015).

19. J. R. Chang and Y. S. Chen, “Pyramid Stereo Matching Network,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition5410–5418 (2018).

20. G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention Concatenation Volume for Accurate and Efficient Stereo Matching,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)12971–12980 (2022).

21. W. Yin, Y. Hu, S. Feng, L. Huang, Q. Kemao, Q. Chen, and C. Zuo, “Single-shot 3D shape measurement using an end-to-end stereo matching network for speckle projection profilometry,” Opt. Express 29(9), 13388–13407 (2021). [CrossRef]

22. W. Luo, A. G. Schwing, and R. Urtasun, “Efficient Deep Learning for Stereo Matching,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)5695–5703 (2016).

23. N. Mayer, E. Ilg, H. Philip, D. Cremers, A. Dosovitskiy, and T. Brox, “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR)4040–4048 (2016).

24. A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-End Learning of Geometry and Context for Deep Stereo Regression,” 2017 IEEE International Conference on Computer Vision (ICCV)66–75 (2017).

25. X. Guo, K. Yang, W. Yang, X. Wang, and H. Li, “Group-Wise Correlation Stereo Network,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)3268–3277 (2019).

26. F. Wang, S. Galliani, C. Vogel, P. Speciale, and M. Pollefeys, “PatchmatchNet: Learned Multi-View Patchmatch Stereo,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)14189–14198 (2021).

27. P. Zhou, J. P. Zhu, and Z. S. You, “3-D face registration solution with speckle encoding based spatial-temporal logical correlation algorithm,” Opt. Express 27(15), 21004–21019 (2019). [CrossRef]

28. K. Fu, Y. Xie, H. Jing, and J. Zhu, “Fast spatial–temporal stereo matching for 3D face reconstruction under speckle pattern projection,” Image Vis. Comput. 85, 36–45 (2019). [CrossRef]

29. C. Zuo, L. Huang, M. Zhang, Q. Chen, and A. Asundi, “Temporal phase unwrapping algorithms for fringe projection profilometry: A comparative review,” Opt. Lasers Eng. 85, 84–103 (2016). [CrossRef]

30. P. Zhou, J. Zhu, and H. Jing, “Optical 3-D surface reconstruction with color binary speckle pattern encoding,” Opt. Express 26(3), 3452–3465 (2018). [CrossRef]

31. L. Zhang, Q. Chen, C. Zuo, and S. Feng, “Real-time high dynamic range 3D measurement using fringe projection,” Opt. Express 28(17), 24363–24378 (2020). [CrossRef]

32. Y. Zheng, S. Wang, Q. Li, and B. Li, “Fringe projection profilometry by conducting deep learning from its digital twin,” Opt. Express 28(24), 36568–36583 (2020). [CrossRef]

33. F. Wang, C. Wang, and Q. Guan, “Single-shot fringe projection profilometry based on deep learning and computer graphics,” Opt. Express 29(6), 8024–8040 (2021). [CrossRef]

34. The Blender Foundation, Blender.org. https://www.blender.org/.

35. D. P. Rohe and E. M. C. Jones, “Generation of Synthetic Digital Image Correlation Images Using the Open-Source Blender Software,” Experimental Techniques, 1–17 (2021).

36. G. Huang, Z. Liu, L. V. D. Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2261–2269 (2017).

37. Y. Huang, J. Zhu, and S. Yang, “Stereo matching algorithm based on attention mechanism,” Comput. Appl. Softw. 39(7), 245–309 (2022).

Intrinsic Parameters	f_x	f_y	u₀	v₀	k₁	k₂	k₃	p₁	p₂
Value	3370	3372	643	503	0.0007	0.0656	11.2204	-0.0022	-0.0041
Extrinsic Parameters	R: $[\begin{array}{ccc} 0 .9398 & 0 .0189 & 0 .3411 \\ - 0 .0179 & 0 .9998 & - 0 .0062 \\ - 0 .3412 & - 0 .0003 & 0 .9400 \end{array}]$				T: $[\begin{matrix} - 405.0343 \\ 2.4533 \\ 94.8984 \end{matrix}]$		Resolution:1280 × 1024

Method	0.5-Pixel-Error(%)	1-Pixel-Error(%)	EPE(px)
Ours(DCSM)	6.41	1.52	1.2997
PSM[19]	7.22	1.88	1.3310
ACV[20]	11.50	2.01	2.0299
Yin’s Net[21]	7.67	2.61	1.7297
AMSN[37]	10.24	2.47	1.3537
SGM[16]	28.73	10.79	1.7360
ZNCC[15]	30.62	15.14	1.7456

Method	Mean_dis.	Std_dev.	RMSE
Ours(DCSM)	0.191741	0.152788	0.245171
PSM[19]	0.199083	0.189935	0.275153
ACV[20]	0.232358	0.220529	0.320348
Yin’s Net[21]	0.223951	0.257567	0.341314
AMSN[37]	0.223416	0.213154	0.308787
SGM[16]	0.297341	0.227684	0.374502
ZNCC[15]	0.333671	0.483199	0.587223

Method	0.5-Pixel-Error(%)
Ours(DCSM)	4.81
PSM[19]	5.61
ACV[20]	6.58
Yin’s Net[21]	6.63
AMSN[37]	7.22

Poses	Method	Ours(DCSM)	PSM [19]	ACV [20]	Yin’s Net [21]	AMSN [37]
Angle a	Mean_dis.	0.254241	0.267136	0.286057	0.277827	0.288237
	Std_dev.	0.254948	0.443251	0.284305	0.518365	0.395356
	RMSE	0.360051	0.517526	0.403308	0.588124	0.489297
Angle b	Mean_dis.	0.253947	0.254997	0.295917	0.266550	0.294515
	Std_dev.	0.211153	0.217897	0.234766	0.223852	0.236836
	RMSE	0.330265	0.335414	0.377732	0.348078	0.377929
Angle c	Mean_dis.	0.246727	0.247760	0.285989	0.268706	0.282775
	Std_dev.	0.218833	0.250935	0.283937	0.428864	0.277381
	RMSE	0.329791	0.352638	0.403001	0.506091	0.396109
Angle d	Mean_dis.	0.242783	0.250708	0.279181	0.255912	0.297050
	Std_dev.	0.222015	0.244599	0.258439	0.254188	0.317691
	RMSE	0.328990	0.350261	0.380438	0.360697	0.434933

Accurate 3D reconstruction of single-frame speckle-encoded textureless surfaces based on densely connected stereo matching network

Abstract

1. Introduction

2. Principles

2.1 Speckle projection profilometry (SPP)

2.2 Fringe projection profilometry (FPP)

2.3 Establishment of 3D measurement system

2.4 Construction of the SPP dataset

3. Densely connected stereo matching network

3.1 Network architecture

3.2 Loss function

4. Experimental results and discussion

4.1 Accuracy evaluation

4.2 Generalization tests

4.3 Dynamic tests

5. Conclusion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (13)

Tables (5)

Equations (16)

Optics Express