EGOF-Net: epipolar guided optical flow network for unrectified stereo matching

Yunpeng Li; Yunpeng Li; Baozhen Ge; Baozhen Ge; Qingguo Tian; Qingguo Tian; Qieni Lu; Qieni Lu; Jianing Quan; Jianing Quan; Qibo Chen; Qibo Chen; Lei Chen

doi:10.1364/OE.440241

1. Introduction

Finding the correspondence is a significant problem in stereo vision [1–3]. In recent years, one-dimensional matching networks with stereo rectified images have been at the center of much attention. However, this is not enough in a dynamic stereo system designed to measure a large-scale infrastructure with high resolution [4]. In such systems, the left and right cameras’ field of view is relatively small compared to the measuring range. They have to rotate independently many times to complete scanning, making stereo rectification relying on camera calibration impractical. Therefore, the more challenging unrectified stereo matching in a two-dimensional range is essential.

There are two types of approaches for unrectified stereo matching. The first type is a two-step framework that combines an uncalibrated stereo rectification algorithm [5] with a standard stereo matching network [6]. This framework has to implement two different networks simultaneously, and the rectification transformation relocates all the pixels causing the loss of image information by foreshortening effects. The other one is a one-step method based on an optical flow network, which tracks the two-dimensional motion of pixels between adjacent frames. For general purposes, the direction of the pixel displacement is unconstrainted, which often leads to mismatches in texture-less areas [7]. Epipolar geometry constraints can effectively refine the optical flow estimation in traditional energy-minimization-based algorithms [8]. However, with unknown epipolar geometry, adding such constraint directly to the loss function is ineffective in neural networks. Alternatively, Zhong et al. [9] propose two weaker restrictions, including a low-rank and a union-of-subspaces constraint, for unsupervised training and achieves competitive results. In the structure from motion (SFM) networks [10,11], the optical flow network is a sub-module without any constraints, and their output is less concerning intermediate features for estimating the camera positions. The improvement of unrectified stereo matching with epipolar geometry in neural networks remains a problem.

We propose an epipolar-guided optical flow network (EGOF-Net) for the unrectified stereo matching problem by introducing epipolar constraints in an optical flow network without loss of image information. The network applies a recurrent neural network (RNN) architecture [12] as the backbone, and the iterative nature of the RNN architecture makes it possible for a full image size searching range in the unrectified stereo matching. There are two novel modules in the network, including a 4-dimensional epipolar modulator (4D-EM) and a deep cross-checking-based fundamental matrix estimation method (DCCM). The 4D-EM module has no trainable parameters and extends the robust point modulation [13,14] to epipolar line modulation, suppressing the false matches away from the epipolar line. In stereo vision, searching for correspondence along the epipolar line and estimating the epipolar line is a pair of “chicken-and-egg” problems. Once the estimated epipolar line is in the wrong direction, the matching result worsens. To this, we compare a variety of robust fundamental matrix estimation methods [5,10,15,16] and propose the DCCM. Considering the randomness of the network output in the occlusion area, the module introduces the cross-check technique to filter out the inconsistent points in the left-to-right and right-to-left optical flow and effectively improve the accuracy of the estimated fundamental matrix. Our network outperforms the other methods both on synthetic and real-scene datasets, and can be applied in an existing dynamic stereo vision system.

To summarize, the main contributions of this paper is as follows:

(1) A novel network architecture EGOF-Net for unrectified stereo matching. The iterative nature of the network achieves a full-image searching range. By directly utilizing the epipolar constraint in the cost volume modulation, the network effectively filters out false matches inconsistent with epipolar geometry. Experiments on various datasets show that this network can significantly improve optical flow estimation.
(2) A 4D-EM module that directly regularizes the searching range along the epipolar lines in the network. This module does not need any training. It suppresses cost volume values away from the epipolar lines and provides a cleaner cost volume for subsequent network-based cost aggregation operations.
(3) A DCCM that robustly estimates the fundamental matrix from the stereo images. It applies the cross-check technique to remove unreliable pixel correspondences, which makes full use of the randomness of the optical flow outputs in the occluded area. The experiments show that the DCCM outperforms the classical and recently proposed methods.

2. Related works

2.1 Unrectified stereo matching based on optical flow

Stereo matching with unrectified images has to search in a 2D space. The output of the correspondence is called optical flow, which indicates the horizontal and vertical displacement of each pixel. The methods of optical flow estimation fall into two categories that are traditional heuristic algorithms and neural networks.

One of the best heuristic methods is the TV-L1 method [17], which considers optical flow estimations an energy minimization problem. Under the total variation framework, an L1 normalization data term and a regularization term handle the motion discontinuity and outliers appropriately. Once there only exists the camera movement, the adjacent frames satisfy the epipolar geometric constraints. To take advantage of this fact and estimate flow along the epipolar lines, Yamaguchi et al. [18] propose a slanted-plane MRF model and effectively enhance the accuracy. Mohamed et al. [8] derive the formulation from using the epipolar constraint for differential optical flow and adding it to the objective function.

With the evolution of computational power, the network methods have conquered the most popular benchmarks, such as KITTI [19] and Sintel [20]. The network architectures are divided into three styles, which are encoder-decoder style [21–23], cascaded style [24,25], and recurrent style [12,26]. The encoder-decoder and cascaded style have no iteration operation that the forward propagation is efficient. However, due to the limited size of the receptive field, the network does not work well with pixels that have a large displacement. The recurrent network can realize a full-image-range search under acceptable memory consumption by iteratively updating the current optical flow, which can handle large displacements.

The general purposed optical flow networks have no restriction on searching direction, leading to failure in texture-less areas. In Structure from Motion (SfM) networks, the optical flow network is only a sub-module to estimate camera positions and intermediate output features for depth estimation. Yao et al. [27] align deep image features from different views via the differentiable homographic warping to the same view and achieve good depth estimation results. Im et al. [28] construct a deep plane sweep cost volume by warping deep features and regress the cost volume to a dense depth map. Wang et al. [10] propose a normalized pose estimation module and realize a scale-independent plane sweep. These works focus on pixel depth rather than the optical flow, and the warping operation relocates pixels in reference views. Besides architectural improvements, Zhong et al. [9] suggest a low-rank constraint and a union-of-subspaces constraint for unsupervised training and achieve competitive results. In summarize, how to improve optical flow with the epipolar geometry in a direct way yet no information loss of warping is undiscovered.

2.2 Epipolar geometry estimation

Fundamental matrix estimation is the prerequisite for epipolar constraint. The normalized 8-point algorithm [29] is the most popular method. However, outliers are inevitable in practice. Robust estimation models such as random sample consensus (RANSAC) and least median squares (LMedS) are essential. These operations are non-differentiable and do not suit an end-to-end network. There are two types of approaches for achieving end-to-end style. One is replacing the robust 8-point algorithm with an end-to-end sub-network. Ummenhofer et al. [11] use a fully connected sub-module after the feature encoder to output the camera’s pose matrix. Poursaeed et al. [30] apply a Siamese network for fundamental matrix estimation without relying on point correspondences. The other simulates the iterative robust sampling procedure to be differentiable operations. Branchmann et al. [31] propose the DSAC network, which replaces the deterministic hypothesis selection with a learnable probabilistic selection. Ranftl et al. [32] cast the sampling procedure as a series of weighted homogeneous least-squares problems and estimate the weights by a deep neural network. On the contrary, when the end-to-end fashion is not preferred, combining the correspondence output from the network and the classical robust estimation model may get a better generalization performance. Wang et al. [10] use SIFT key point locations to generate a mask for sampling correspondences from the network’s optical flow prediction and utilize the RANSAC model to estimate the essential matrix.

3. EGOF-Net

3.1 Network architecture

The network has three weight-sharing epipolar guided optical flow core (EGOF-Core) modules and one DCCM module, as shown in Fig. 1. A pair of unrectified stereo images, I_L and I_R, is input. The implementation divides into two stages. At the first stage, I_L and I_R are transmitted into the input 1 and 2 of the first two EGOF-Cores at the bottom of Fig. 1, with an order of (I_L, I_R) and (I_R, I_L), respectively. The outputs are unconstrained left-to-right and right-to-left flows, V_LR and V_RL. The DCCM module then estimates the fundamental matrix F with V_LR and V_RL. At the second stage, the third EGOF-Core, the upper one in Fig. 1, receives (I_L, I_R) and F to predict the epipolar guided optical flow V. The epipolar guided optical flow V and fundamental matrix F are the final outputs of the whole EGOF-Net.

Fig. 1. The architecture of the EGOF-Net.

Download Full Size | PDF

In the EGOF-Core, V is a tensor that depicts 2D vectors of pixel displacement from its input 1 and 2, I₁ to I₂ as

(1)$${\boldsymbol V}(u,v) = [u^{\prime} - u,v^{\prime} - v] = [\Delta u,\Delta v], $$

where (u, v) is an arbitrary pixel coordinate in I₁, (u′, v′) is the corresponding pixel in I₂, and [Δu, Δv] is the vector of pixel displacement.

Fundamental matrix F describes the epipolar geometry between I_L and I_R, a 3×3 matrix with rank 2. The epipolar constraint is as

(2)$${(x^{\prime})^T}{\boldsymbol F}x = {(x^{\prime})^T}l^{\prime} = lx = 0, $$

where x and x′ are column vectors of homogeneous coordinates in I_L and I_R, [u, v, 1]^T and [u′, v′, 1]^T, l and l′ are column vectors representing the epipolar line in left and right images. Just with V and F can we perform a 3D reconstruction in affine space. Plus intrinsic parameters of the cameras, a metric reconstruction is available [33].

3.2 EGOF-Core

It has two image inputs, I₁ and I₂, and an optional fundamental matrix input, F. The shapes of the I₁ and I₂ are h×w, and the output is the optical flow from the I₁ to I₂. There are three trainable modules and one non-training module. The trainable modules include two weight-sharing Feature Encoders (FE), one Context Encoder (CE), and one Recurrent Update Module (RUM). The non-training module is the 4D-EM.

In the forward propagation, the two FE modules extract deep similarity features from inputs 1 and 2. Then, the features are correlated to form a 4D cost volume, whose shape is h/8×w/8×h/8×w/8. To visualize this 4D cost, we collapse the first two dimensions to one and show it as an hw/64×h/8×w/8 3D tensor in Fig. 1. Each 1×h/8×w/8 slice of the tensor represents the matching cost of one pixel in input 1 to the whole pixels in input 2. The higher the cost value, the more possible where the correct matching is. The 4D-EM further modulates this 4D cost and passes it to the RUM. At the same time, I₁ is fed into the CE module to be context features that guide the update of the optical flow. The context features separate into two equal-shaped tensors. One is called the Static_Guide, and the other is h(t), t = 0. The Modulated 4D Cost, Static_Guide, h(t), and a zero-initialized Flow go to the RUM. After N iterations of the RUM, the network outputs the final optical flow estimation. The updating process can be written as:

(3)$${\boldsymbol{RUM}}\left( {\begin{array}{cccc} {{\boldsymbol{Static\_Guide}}\textrm{,}}&{{\boldsymbol{Cost}}\textrm{,}}&{{\boldsymbol h}\textrm{(}t\textrm{),}}&{{\boldsymbol{Flo}}{{\boldsymbol w}^{(t)}}} \end{array}} \right) = \left[ {\begin{array}{cc} {{\boldsymbol h}\textrm{(}t + \textrm{1),}}&{{\boldsymbol{Flo}}{{\boldsymbol w}^{(t + 1)}}} \end{array}} \right], $$

where Cost is the Modulated 4D Cost. From Eq. (3), one can see that the RUM should be a gated recurrent unit (GRU) network. It learns how to update the Flow and its hidden state h(t) with current features, which is the bottom logic of the whole optical flow network.

The FE and CE modules are in Table 1. They have similar architectures except for the normalization layers. The modules start with a basic convolution layer, and the central part contains 3×2 residual layers [34]. After the main part, another basic convolution layer adapts the feature output. For the normalization layers, the FE and CE apply instance normalization and batch normalization, respectively.

Table 1. The structure of the FE and CE modules.

View Table | View all tables in this article

The architecture of the iterative RUM is in Table 2. One of the inputs, the Modulated 4D Cost, is pyramid sampled by the Corr_Pyr_Sp layer in four levels within the 9×9 neighbors of the estimated flow position from the Flow. The sampled cost then goes into the Conv_C1_Relu and Conv_C2_Relu layers. At the same time, the Flow goes into the Conv_F1_Relu and Conv_F2_Relu layers. Their outputs are concatenated and passed to the Conv_CF_Relu layer to get motion features. The features, Static_Guide, and Flow, concatenate together as Concat1 layer. The critical component, GRU_Core layer, a separative convolutional GRU layer [12], updates the hidden information h(t) with Concat1. After the GRU_Core layer, the network flows to two branches. The branch begins from the Conv_FH1_Relu layer is used for updating the current Flow estimation. The Conv_FH2 layer adapts updating information and sums to Flow to get Update_Flow. The other branch, which starts from Conv_M1_Relu and Conv_M2 layers, prepares for up-sampling the optical flow to its full resolution. By a convex combination [35], the two branches merge and output the optical flow Flow^(t+1) and hidden state h(t+1) in the current iteration.

Table 2. The structure of the RUM.

View Table | View all tables in this article

3.3 4D-EM module

When the EGOF-Net receives an F input, it can directly constrain the searching range along the epipolar lines by the 4D-EM module. As in Fig. 2, we visualize a slice of the 4D cost in a 3D surface model. The sliced cost represents the correlation of one pixel in the left image over all pixels in the right image, and the grid of the horizontal plane is the same as that of the right image. Two false matches are in the matching cost, and both of them have a higher correlation value than the true match. The epipolar geometry can avoid wrong estimates. Inspired by [13,14], we use a 1D Gaussian filter along the perpendicular direction to the epipolar line for generating a 2D image filter in the sliced matching cost space. The name of this filter is an epipolar modulator. The maximum value of the modulator is G, and its standard deviation is σ. Applying a dot product between the matching cost and epipolar modulator, only the true matching candidates survive. Furthermore, the value of the true match is higher and sharper due to the modulated effect of G.

Fig. 2. The working process of 4D-EM.

Download Full Size | PDF

The 4D-EM M(u, v, u′, v′) is a 4D tensor, and the first and last two indices represent the coordinate in the left and right images, respectively. In Eq. (2), each pixel (u, v) in the left image corresponds to an epipolar line l′ in the right image. The l′ is a function of (u, v) and presents as l′(u, v). We expand the l′(u, v) as

(4)$$l^{\prime}(u,v) = {[{\boldsymbol a^{\prime}}(u,v),{\boldsymbol b^{\prime}}(u,v),{\boldsymbol c^{\prime}}(u,v)]^T}, $$

where a′, b′, and c′ are functions of (u, v). The calculation of M(u, v, u′, v′) is as

(5)$${\boldsymbol M}(u,v,u^{\prime},v^{\prime}) = G \cdot \exp \left( {\frac{{ - {{|{{\boldsymbol d}(u,v,u^{\prime},v^{\prime})} |}^2}}}{{2{\sigma^2}}}} \right), $$

(6)$${\boldsymbol d}(u,v,u^{\prime},v^{\prime}) = \frac{{|{{\boldsymbol a^{\prime}}(u,v) \cdot u^{\prime} + {\boldsymbol b^{\prime}}(u,v) \cdot v^{\prime} + {\boldsymbol c^{\prime}}(u,v)} |}}{{\sqrt {{{|{{\boldsymbol a^{\prime}}(u,v)} |}^2} + {{|{{\boldsymbol b^{\prime}}(u,v)} |}^2}} }}, $$

where d(u, v, u′, v′) is a 4D tensor that holds the distances from the pixel (u′, v′) to the epipolar line l′(u, v) in the right image, the G and σ are fixed hyperparameters that control the amplitude and tolerance of the 4D-EM. We further study the choice of the two parameters in Section 4.4.

When there is no fundamental matrix to the EGOF-Net, the 4D-EM will generate an all-ones 4D modulator that does not constrain the searching direction in the 4D cost volume.

3.4 DCCM module

The DCCM builds upon the idea that the V_LR and V_RL in Fig. 1 are independent. In the EGOF-Net, the processing pipelines of inputs 1 and 2 are not symmetric. For input 1, there is an additional CE module to extract context features. Therefore, when the V_LR and V_RL are consistent for a particular pair of pixel correspondence, there will be a greater probability that this matching is correct.

The DCCM contains four steps, as shown in Fig. 1. (a) Estimate the V_LR and V_RL without any fundamental matrix input. (b) Get a consistency mask Ξ by cross-checking the V_LR and V_RL. (c) Random sample correspondences with Ξ. (d) Estimate fundamental matrix F.

Based on Eq. (1), the cross-checking calculation to obtain Ξ is as

(7)$$[{u^{\prime (d)}},{v^{\prime (d)}}] = {{\boldsymbol V}_{LR}}({u^{(i)}},{v^{(i)}}) + [{u^{(i)}},{v^{(i)}}], $$

(8)$$[{u^{(d)}},{v^{(d)}}] = {{\boldsymbol V}_{RL}}( [\kern-0.15em[ {u^{\prime (d)}} ]\kern-0.15em] , [\kern-0.15em[ {v^{\prime (d)}} ]\kern-0.15em] ) + [ [\kern-0.15em[ {u^{\prime (d)}} ]\kern-0.15em] , [\kern-0.15em[ {v^{\prime (d)}} ]\kern-0.15em] ], $$

(9)$${\mathbf \Xi }({u^{(i)}},{v^{(i)}}) =\{ \sqrt {{{({u^{(d)}} - {u^{(i)}})}^2} + {{({v^{(d)}} - {v^{(i)}})}^2}} < t\}, $$

wherein (u⁽ⁱ⁾, v⁽ⁱ⁾) is the integer coordinate of all pixels in the left image. The (u′^(d), v′^(d)) is the estimated position of the matched pixel in the right image, which comes from the V_LR. The superscript (d) indicates that the value is a decimal number. The $[\kern-0.15em[{\cdot} ]\kern-0.15em] $ is the rounding operation. The (u^(d), v^(d)) come from the V_RL. If the Euclidean distance from (u⁽ⁱ⁾, v⁽ⁱ⁾) to (u^(d), v^(d)) is smaller than a threshold t, this pixel in the consistency mask is True. Otherwise, it is False. We choose t = 1.

Because estimating a fundamental matrix with too many pairs of points is inefficient, we randomly sample K = 2000 robust matches within the consistency mask Ξ. The K samples are further fed into the LMedS based 8-point algorithm. Rewrite Eq. (2) as a linear equation system in the nine unknown elements of F:

(10)$$\left[ {\begin{array}{ccccccccc} {\begin{array}{c} {{u_1}{{u^{\prime}}_1}}\\ \vdots \\ {{u_K}{{u^{\prime}}_K}} \end{array}}&{\begin{array}{c} {{v_1}{{u^{\prime}}_1}}\\ \vdots \\ {{v_K}{{u^{\prime}}_K}} \end{array}}&{\begin{array}{c} {{{u^{\prime}}_1}}\\ \vdots \\ {{{u^{\prime}}_K}} \end{array}}&{\begin{array}{c} {{u_1}{{v^{\prime}}_1}}\\ \vdots \\ {{u_K}{{v^{\prime}}_K}} \end{array}}&{\begin{array}{c} {{v_1}{{v^{\prime}}_1}}\\ \vdots \\ {{v_K}{{v^{\prime}}_K}} \end{array}}&{\begin{array}{c} {{{v^{\prime}}_1}}\\ \vdots \\ {{{v^{\prime}}_K}} \end{array}}&{\begin{array}{c} {{u_1}}\\ \vdots \\ {{u_K}} \end{array}}&{\begin{array}{c} {{v_1}}\\ \vdots \\ {{v_K}} \end{array}}&{\begin{array}{c} 1\\ \vdots \\ 1 \end{array}} \end{array}} \right] \cdot f = 0, $$

where f = [f₁₁, f₁₂, f₁₃, f₂₁, f₂₂, f₂₃, f₃₁, f₃₂, f₃₃]^T is F in column vector shape. The sample amount is K > 8 so that Eq. (10) is overdetermined. We solve this equation system by the LMedS method that searches for the F which minimizes the target function [15] as:

(11)$$Target({\boldsymbol F} )= \sum\limits_{i = 1}^K {{\omega _i}[{{D^2}({{x^{\prime}}_i},{\boldsymbol F}{x_i}) + {D^2}({x_i},{{\boldsymbol F}^T}{{x^{\prime}}_i})} ]}, $$

(12)$${\omega _i} = \left\{ {\begin{array}{cc} 1&{{D^2}({{x^{\prime}}_i},{\boldsymbol F}{x_i}) + {D^2}({x_i},{{\boldsymbol F}^T}{{x^{\prime}}_i}) \le {{(2.5\hat{\sigma })}^2}}\\ 0&{otherwise} \end{array}} \right., $$

(13)$$\hat{\sigma } = 1.4826[{1 + 5/({K - Q} )} ]\sqrt {\mathop {Median}\limits_{i = 1,\ldots ,K} [{{D^2}({{x^{\prime}}_i},{\boldsymbol F}{x_i}) + {D^2}({x_i},{{\boldsymbol F}^T}{{x^{\prime}}_i})} ]}, $$

where D(x, l) is the function returning pixel distance from point x to line l as Eq. (6), ω is binary weight, K is the number of sample point pairs, Q = 8 is the minimum number of samples for estimating an F. The Levenberg-Marquardt algorithm optimizes the target function in Eq. (11) and finds a robust estimation. For better numerical stability, the coordinates are normalized before solving the equation system and denormalized after the robust optimization [29].

3.5 Loss function

We use the L1 norm to regularize the iterative optical flow outputs from the RUM. During the training, the fundamental matrix input of the EGOF-Net is left empty. The RUM iterates N times and outputs N optical flow maps [V₁, V₂, … V_N]. The loss function is

(14)$$loss = \sum\limits_{i = 1}^N {{g^{N - i}}\left( {{{\sum\limits_{}^{} {{\mathbf \Omega } \odot {{||{{{\boldsymbol V}_{gt}} - {{\boldsymbol V}_i}} ||}_1}} } / {\sum\limits_{}^{} {\mathbf \Omega } }}} \right)}, $$

wherein Ω is a binary mask that marks the non-occluded area in ground truth optical flow map V_gt. The ${\odot} $ is dot product operator. The g is the composing coefficient for the N outputs. The final output V_N is the most important so that we should choose g < 1. Empirically, we use g = 0.8 and N = 12 in our experiment.

4. Experiment

To verify the performance of our method, we test on different datasets and an existing system. The organization of this section is as follows. First, we introduce the training and validation dataset. Then comes the implementation details. In the following sub-sections, we conduct the ablation study of the proposed 4D-EM and DCCM, compare the EGOF-Net with other state-of-the-art optical flow networks on both synthetic and real-scene datasets, and apply our network to a long focal length and wide baseline dynamic stereo system that reconstructs 3D point clouds at a long distance.

4.1 Dataset

Our goal is to apply the EGOF-Net in a dynamic stereo system with a wide baseline, which requires the training dataset to have a wide range of optical flow values. However, the most popular KITTI-flow [19] and Sintel [20] do not have large enough pixel displacements. To solve this problem, we enhance the FlyingThings-stereo dataset [36] to be an unrectified stereo dataset by adding random rotation, translation, and scale transformations. We call it FlyingThings-Ustereo. The FlyingThings-Ustereo dataset has 21818 pairs of unrectified stereo images for training and another 400 pairs for validation. The image resolution is 540×960, and the optical flow values in horizontal and vertical directions belong to [-512, 512] pixels. We give a sample of the dataset in Fig. 3.

Fig. 3. A sample of FlyingThings-Ustereo dataset. (a) – (d) left image, optical flow, the valid mask of enhancement, and occlusion mask. (e) – (h) right image, optical flow, the valid mask of enhancement, and occlusion mask. (i) color encoding of the optical flow.

Download Full Size | PDF

4.2 Implementation details

We implement the EGOF-Net with the PyTorch library. AdamW solver is used to updating the network weights. The gradient is cropped within [-1, 1] because the network is RNN-style. One NVidia RTX 3090 GPU trains the network. For the datasets, we consecutively use the FlyingChairs, FlyingThings-Flow, Sintel, and the FlyingThings-Ustereo datasets for training. Table 3 gives the details of hyperparameters in different training phases.

Table 3. Training details

View Table | View all tables in this article

4.3 Epipolar geometry estimation with DCCM

We compare a few fundamental matrix estimation methods to find the best one for the EGOF-Net. The validation set of the FlyingThings-Ustereo dataset quantitatively measures the accuracy of the methods. For extracting the correspondence points, there are five approaches to test: (a) GMS feature [37], (b) SIFT feature, (c) random sampling from the output of the EGOF-Net, (d) A SIFT mask-based sampling strategy [10] over the network output, and (e) our DCCM. All these correspondence sampling methods combine with the RANSAC or LMedS algorithms. The error metric of the estimated fundamental matrix is the symmetric projection error (SPE) described in [38]. For any matching point pair (u, v) and (u′, v′), we use Eq. (2) to calculate the epipolar line pairs l and l′ and represent them as [a, b, c] and [a′, b′, c′]. The definition of SPE is

(15)$$SPE = \frac{1}{2} \cdot \left( {\frac{{|{au + bv + c} |}}{{\sqrt {{a^2} + {b^2}} }} + \frac{{|{a^{\prime}u^{\prime} + b^{\prime}v^{\prime} + c^{\prime}} |}}{{\sqrt {{{a^{\prime}}^2} + {{b^{\prime}}^2}} }}} \right). $$

The smaller the SPE, the more accurate the fundamental matrix is. The experimental results in Table 4 show that under the same feature point extraction method, the accuracy of the LMedS algorithm is better than that of RANSAC, and the DCCM with LMedS achieves the best result with an average SPE of 0.20 pixels, which leaves a substantial margin over the other methods. The visualized results in Fig. 4 are consistent with Table 4 that ours with LMedS has the highest accuracy.

Fig. 4. Results of fundamental matrix estimation with different methods. Purple circles mark identical points. The green lines are the ground truth of epipolar lines, and the red lines are epipolar line estimations—the less the green part, the better.

Download Full Size | PDF

Table 4. The average SPE with different methods

View Table | View all tables in this article

4.4 Ablation study of the 4D-EM module

There are two hyperparameters in the 4D-EM. We study the different combinations of the parameters with the FlyingThings-Ustereo dataset. The endpoint error (EPE), the average length difference between the ground truth and the estimated flow, is used as the main metric to evaluate the optical flow accuracy. Besides, the 3-pixel error rate (3PE), which tells the percentage of correctly estimated flow with a threshold of 3 pixels, is also used. The lower the EPE and 3PE, the better the results. The searching range of the hyperparameters G and σ is [0.7, 1.0, 1.3, 1.5, 2.0, 5.0, 10.0]. We show the statistics in Table 5 and Table 6. Theoretically, the G value affects the overall modulation intensity of the 4D-EM, and the σ value affects the width of the modulation along the epipolar line. In the experiment, from the perspective of EPE, the combination of G = 1.3 and σ = 2.0 has the best performance, reaching an average error of 0.95 pixels. At the same time, the 3PE is 4.37%, significantly better than the 1.25 EPE and 5.52% 3PE when 4D-EM is all one. From the perspective of 3PE, the combination of G = 1.5 and σ = 0.7 is the best. However, the EPE at this time is slightly lower than the result of not using the 4D-EM. A possible explanation is that the resolution of the 4D cost is smaller than that of the input image. Too small σ may remove details close to the epipolar line in the full resolution image. In general, we can read that most of the combinations successfully enhanced the performance, and we choose G = 1.3 and σ = 2.0 in the following experiments.

Table 5. EPE with respect of different hyper parameters

View Table | View all tables in this article

Table 6. 3PE with respect of different hyper parameters

View Table | View all tables in this article

4.5 On synthetic dataset

To compare the EGOF-Net with other state-of-the-art networks, we first train the networks with similar protocols and finetune them in the synthetic FlyingThings-Ustereo dataset. For the reference networks, the popular networks such as LiteflowNet3 [24], PWCNet [25], and IRR-PWC [39] cannot reach the range of our dataset because the fixed architecture restricts the networks’ effective receptive field. Only the network which can fit our dataset is the reference. The encoder-decoder style network, VCN [22], and the iterative style RAFT [12] are selected. EPE and 3PE are the metrics for comparison. The statistics in Table 7 show that the RAFT is better than the VCN. A plausible explanation is that the RAFT deals cost volume with a higher resolution and iteratively updates the flow. Compared to the RAFT, our EPE and 3PE are 23% and 20% lower. As shown in Fig. 5, the VCN and RAFT fail in some texture-less areas, whereas ours achieve successful matches in the same areas. The visualization is consistent with the statistics in Table 7. The experiment shows that the EGOF-Net outperforms the others with a substantial margin.

Fig. 5. Optical flow results of different methods on FlyingThings-Ustereo Dataset. The gray pixels mark where the EPE is greater than 3 pixels, the fewer, the better. Please zoom in to see more details.

Download Full Size | PDF

Table 7. Correspondence results of different methods

View Table | View all tables in this article

4.6 On the real-scene dataset

To test the network's generalization, we also compare the VCN, RAFT, and EGOF-Net on the Middlebury-Ustereo dataset augmented based on the Middlebury stereo dataset [40] with the same method for the FlyingThings-Ustereo dataset. The dataset does not train the networks. The statistics and visualization of the experiment are in Table 8 and Fig. 6. Our method achieves the best EPE and 3PE through all the ten image pairs in the dataset. Areas within the red frames in Fig. 6 show the significant improvement of our network compared to the others. These areas usually lack texture features. The experiment proves that our network has a good generalization ability and outperforms the other methods.

Fig. 6. Results on Middlebury-Ustereo Dataset. The gray pixels mark where the EPE is greater than 3 pixels, the fewer, the better. Please zoom in to see more details.

Download Full Size | PDF

Table 8. Results on Middlebury-Ustereo Dataset

View Table | View all tables in this article

4.7 3D reconstruction with a dynamic stereo system

Beyond the experiments on datasets, we apply our method to a dynamic stereo system with a long focal length and wide baseline. The system includes two Canon 5D Mark III cameras with a 1200 mm long focal lens. As shown in Fig. 7, the two cameras can rotate individually and shoot at the same target, at a distance from 80 to 120 m. The baseline can be 15 to 25 m long, which guarantees the depth resolution in 3D space. The EGOF-Net takes the unrectified left and right images and outputs an optical flow map and fundamental matrix. The known intrinsic parameters and fundamental matrix can recover the camera matrices, which project the optical flow in reconstructed 3D space with a standard triangulation algorithm [33].

Fig. 7. The long focal length and unconstrainted stereo system.

Download Full Size | PDF

We experiment with two scenes of different camera rotation angles for testing our method. Scene A in Fig. 8 (a) includes three truncated concrete cone models, and scene B in Fig. 9 (a) contains a wooden stair, a black toy mountain, and a concrete model. The measuring distance is 100 m, and the baseline is 20 m. The results are shown in Fig. 8 and Fig. 9. From the estimated epipolar lines, one can see that identical points are correctly aligned. In Fig. 8 (c) and (d), the corresponding epipolar lines are vertically shifted and have a slight relative rotation angle. In Fig. 9 (c), a larger relative rotation angle and substantial scale difference exist, and our network still works well. The 3D reconstruction results indicate the validity of the estimated optical flows. The sliced views show that the curves are consistent with the models. In the zoomed view in Fig. 8 (h), the narrow wave corresponds to the steel ruler pasted to the center concrete model, which shows the fine detail of the 3D model. In Fig. 9 (h), one can see that the reconstructed planes of the wooden stair are parallel to each other. The experiments show that the EGOF-Net is effective in such a dynamic stereo system.

Fig. 8. Optical flow and 3D reconstruction of real scene A. (a) to (h), the left, right image, left, right epipolar lines, estimated flow, reconstructed point cloud, point cloud sliced at a horizontal plane passing the green line in (f), zoomed view of the purple frame in (g).

Download Full Size | PDF

Fig. 9. Optical flow and 3D reconstruction of real scene B. (a) to (h), the left, right image, left, right epipolar lines, estimated flow, reconstructed point cloud, point cloud sliced at a vertical plane passing the green line in (f), zoomed view of the purple frame in (g).

Download Full Size | PDF

5. Conclusion

This paper solves the unrectified stereo matching problem by an EGOF-Net. To take advantage of the epipolar geometry and find correct matches in texture-less areas, we propose a 4D-EM module and a DCCM module. With the help of these modules, our network outperforms the state-of-the-art optical flow networks in both synthetic and real-scene datasets. Besides, we test the network in a dynamic stereo system with a long focal length and wide baseline. At a 100 m distance, we successfully reconstruct three texture-less concrete models in 3D. For now, our method only deals with pure image information, and no camera parameter goes in the network. The paper shows the feasibility and efficiency of deeply combining heuristic algorithms with the latest neural network, which suggests a new opportunity to improve neural networks. Incorporating the optical flow, fundamental matrix, and camera parameters in an end-to-end network while keeping good generalization for real problems is still challenging for neural networks.

Funding

National Natural Science Foundation of China (61535008).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [36,40].

References

1. W. Yin, Y. Hu, S. Feng, L. Huang, Q. Kemao, Q. Chen, and C. Zuo, “Single-shot 3D shape measurement using an end-to-end stereo matching network for speckle projection profilometry,” Opt. Express 29(9), 13388–13407 (2021). [CrossRef]

2. Y. Wang and X. Wang, “On-line three-dimensional coordinate measurement of dynamic binocular stereo vision based on rotating camera in large FOV,” Opt. Express 29(4), 4986–5005 (2021). [CrossRef]

3. A. Khan, M. U. K. Khan, and C.-M. Kyung, “Intensity guided cost metric for fast stereo matching under radiometric variations,” Opt. Express 26(4), 4096–4111 (2018). [CrossRef]

4. Y. Li, B. Ge, Q. Tian, J. Quan, and L. Chen, “Eliminating unbalanced defocus blur with a binocular linkage network,” Appl. Opt. 60(5), 1171–1181 (2021). [CrossRef]

5. A. Fusiello and L. Irsara, “Quasi-Euclidean uncalibrated epipolar rectification,” in Proceedings of 2008 19th International Conference on Pattern Recognition, (IEEE, 2008), pp. 1–4.

6. L. Kou, K. Yang, L. Luo, Y. Zhang, J. Li, Y. Wang, and L. Xie, “Binocular stereo matching of real scenes based on a convolutional neural network and computer graphics,” Opt. Express 29(17), 26876–26893 (2021). [CrossRef]

7. P. Yadati and A. M. Namboodiri, “Multiscale two-view stereo using convolutional neural networks for unrectified images,” inProceedings of 2017 Fifteenth IAPR International Conference on Machine Vision Applications, (IEEE, 2017), pp. 346–349.

8. M. A. Mohamed, M. H. Mirabdollah, and B. Mertsching, “Monocular epipolar constraint for optical flow estimation,” in Proceedings of International Conference on Computer Vision Systems, M. Liu, H. Chen, and M. Vincze, eds. (Springer, 2017), pp. 62–71.

9. Y. Zhong, P. Ji, J. Wang, Y. Dai, and H. Li, “Unsupervised deep epipolar flow for stationary or dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 12095–12104.

10. J. Wang, Y. Zhong, Y. Dai, S. Birchfield, K. Zhang, N. Smolyanskiy, and H. Li, “Deep two-view structure-from-motion revisited,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2021), pp. 8953–8962.

11. B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox, “Demon: depth and motion network for learning monocular stereo,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2017), pp. 5038–5047.

12. Z. Teed and J. Deng, “RAFT: recurrent all-pairs field transforms for optical flow,” in Proceedings of the European Conference on Computer Vision, (Springer, 2020), pp. 402–419.

13. F. Zhang, Y. Chen, Z. Li, Z. Hong, J. Liu, F. Ma, J. Han, and E. Ding, “Acfnet: attentional class feature network for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (IEEE, 2019), pp. 6798–6807.

14. M. Poggi, D. Pallotti, F. Tosi, and S. Mattoccia, “Guided stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 979–988.

15. Z. Zhang, “Determining the epipolar geometry and its uncertainty: a review,” International Journal of Computer Vision 27(2), 161–195 (1998). [CrossRef]

16. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM 24(6), 381–395 (1981). [CrossRef]

17. J. S. Pérez, E. Meinhardt-Llopis, and G. Facciolo, “TV-L1 Optical Flow Estimation,” Image Processing On Line 3, 137–150 (2013). [CrossRef]

18. K. Yamaguchi, D. McAllester, and R. Urtasun, “Robust monocular epipolar flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2013), pp. 1862–1869.

19. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: the KITTI dataset,” The International Journal of Robotics Research 32(11), 1231–1237 (2013). [CrossRef]

20. D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in Proceedings of the European Conference on Computer Vision, (Springer, 2012), pp. 611–625.

21. J. Wang, Y. Zhong, Y. Dai, K. Zhang, P. Ji, and H. Li, “Displacement-invariant matching cost learning for accurate optical flow estimation,” (2020) https://arxiv.org/abs/2010.14851.

22. G. Yang and D. Ramanan, “Volumetric correspondence networks for optical flow,” in Proceedings of Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, eds. (Curran Associates, Inc., 2019), pp. 794–805.

23. A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, (IEEE, 2015), pp. 2758–2766.

24. T.-W. Hui and C. C. Loy, “LiteFlowNet3: resolving correspondence ambiguity for more accurate optical flow estimation,” in Proceedings of the European Conference on Computer Vision, (Springer, 2020), pp. 169–184.

25. D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: cnns for optical flow using pyramid, warping, and cost volume,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2018), pp. 8934–8943.

26. A. Eldesokey and M. Felsberg, “Normalized convolution upsampling for refined optical flow estimation,” (2021) https://arxiv.org/abs/2102.06979.

27. Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: depth inference for unstructured multi-view stereo,” in Proceedings of the European Conference on Computer Vision, (Springer, 2018), pp. 767–783.

28. S. Im, H.-G. Jeon, S. Lin, and I. S. Kweon, “Dpsnet: end-to-end deep plane sweep stereo,” (2019) https://arxiv.org/abs/1905.00538.

29. R. I. Hartley, “In defense of the eight-point algorithm,” IEEE Trans. Pattern Anal. Machine Intell. 19(6), 580–593 (1997). [CrossRef]

30. O. Poursaeed, G. Yang, A. Prakash, Q. Fang, H. Jiang, B. Hariharan, and S. Belongie, “Deep fundamental matrix estimation without correspondences,” in Proceedings of the European Conference on Computer Vision Workshops, (Springer, 2018), pp. 0-0.

31. E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsac-differentiable ransac for camera localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2017), pp. 6684–6692.

32. R. Ranftl and V. Koltun, “Deep fundamental matrix estimation,” in Proceedings of the European Conference on Computer Vision, (Springer, 2018), pp. 284–299.

33. R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2 ed. (Cambridge University, 2004). [CrossRef]

34. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2016), pp. 770–778.

35. A. Singh, F. Porikli, and N. Ahuja, “Super-resolving noisy images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2014), pp. 2846–2853.

36. N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2016), pp. 4040–4048.

37. J. Bian, W.-Y. Lin, Y. Matsushita, S.-K. Yeung, T.-D. Nguyen, and M.-M. Cheng, “Gms: grid-based motion statistics for fast, ultra-robust feature correspondence,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2017), pp. 4181–4190.

38. J.-W. Bian, Y.-H. Wu, J. Zhao, Y. Liu, L. Zhang, M.-M. Cheng, and I. Reid, “An evaluation of feature matchers for fundamental matrix estimation,” (2019) https://arxiv.org/abs/1908.09474v2.

39. J. Hur and S. Roth, “Iterative residual refinement for joint optical flow and occlusion estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 5754–5763.

40. D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” in Proceedings of German Conference on Pattern Recognition, (Springer, 2014), pp. 31–42.

Name	Input	Layer Type	Kernel	Stride	Output Shape
Conv1	I_L or I_R	Conv.	7	2	64×h/2×w/2
Norm1	Conv1	Instance/Batch Norm.	∼	∼	64×h/2×w/2
Relu1	Norm1	RELU Function	∼	∼	64×h/2×w/2
Layer1	Relu1	2 × Residual Block	3	1	64×h/2×w/2
Layer2	Layer1	2 × Residual Block	3	2	96×h/4×w/4
Layer3	Layer2	2 × Residual Block	3	2	128×h/8×w/8
Conv2	Layer3	Conv.	3	1	256×h/8×w/8

Name	Input	Layer Type	Kernel	Stride	Output Shape
Corr_Pyr_Sp	Modulated 4D Cost, Flow	Pyramid Sampling	∼	∼	(81×4)×h/8×w/8
Conv_C1_Relu	Corr_Pyr_Sp	Conv. + RELU	1	1	256×h/8×w/8
Conv_C2_Relu	Conv_C1_Relu	Conv. + RELU	3	1	192×h/8×w/8
Conv_F1_Relu	Flow	Conv. + RELU	7	1	128×h/8×w/8
Conv_F2_Relu	Conv_F1_Relu	Conv. + RELU	3	1	64×h/8×w/8
Conv_CF_Relu	Conv_C2_Relu, Conv_F2_Relu	Conv. + RELU	3	1	126×h/8×w/8
Concat1	Static_Guide, Conv_CF_Relu, Flow	Concatenation	∼	∼	256×h/8×w/8
GRU_Core	h(t), Concat1	Sep. Conv. GRU	∼	∼	128×h/8×w/8
Conv_FH1_Relu	GRU_Core	Conv. + RELU	3	1	256×h/8×w/8
Conv_FH2	Conv_FH1_Relu	Conv.	3	1	2×h/8×w/8
Conv_M1_Relu	GRU_Core	Conv. + RELU	3	1	256×h/8×w/8
Conv_M2	Conv_M1_Relu	Conv.	1	1	(64×9)×h/8×w/8
Update_Flow	Conv_FH2, Flow	Sum	∼	∼	2×h/8×w/8
Upsample_Flow	Update_Flow, Conv_M2	Convex Combination	3	1	2×h×w

Dataset	Crop Size	Batch	Iteration	Learning Rate	Weight Decay	Freeze BN^a
FlyingChairs	368×496	8	120k	0.00025	0.0001	No
FlyingThings-Flow	400×720	5	120k	0.0001	0.0001	Yes
Sintel	368×768	5	50k	0.0001	0.00001	Yes
FlyingThings-Ustereo	400×720	12	40k	0.000125	0.00001	Yes

GMS Feature +		SIFT Feature +		SIFT-mask +		Random Sample +		DCCM (Ours) +
RANSAC	LMedS	RANSAC	LMedS	RANSAC	LMedS	RANSAC	LMedS	RANSAC	LMedS
11.55	10.58	5.36	4.82	1.08	0.61	0.72	0.31	0.64	0.20

Parameter	σ = 0.7	σ = 1.0	σ = 1.3	σ = 1.5	σ = 2.0	σ = 5.0	σ = 10.0
G = 0.7	94.20	33.49	13.91	8.49	4.40	1.18	1.18
G = 1.0	7.06	2.03	1.29	1.32	1.25	1.00	1.10
G = 1.3	1.54	1.28	1.23	1.27	0.95	1.01	1.10
G = 1.5	1.28	1.24	1.07	0.97	1.02	1.07	1.13
G = 2.0	1.30	1.18	1.04	1.06	1.19	1.18	1.39
G = 5.0	1.70	1.87	1.93	1.92	2.07	2.37	2.63
G = 10.0	3.09	3.27	3.63	3.69	3.83	4.68	6.16

EGOF-Net: epipolar guided optical flow network for unrectified stereo matching

Abstract

1. Introduction

2. Related works

2.1 Unrectified stereo matching based on optical flow

2.2 Epipolar geometry estimation

3. EGOF-Net

3.1 Network architecture

3.2 EGOF-Core

3.3 4D-EM module

3.4 DCCM module

3.5 Loss function

4. Experiment

4.1 Dataset

4.2 Implementation details

4.3 Epipolar geometry estimation with DCCM

4.4 Ablation study of the 4D-EM module

4.5 On synthetic dataset

4.6 On the real-scene dataset

4.7 3D reconstruction with a dynamic stereo system

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (8)

Equations (15)

Optics Express

Parameter	σ = 0.7	σ = 1.0	σ = 1.3	σ = 1.5	σ = 2.0	σ = 5.0	σ = 10.0
G = 0.7	43.78	21.36	12.34	9.67	7.05	5.32	5.39
G = 1.0	7.92	5.15	4.54	4.51	4.51	4.83	5.08
G = 1.3	4.61	4.34	4.34	4.43	4.37	4.90	5.19
G = 1.5	4.27	4.31	4.43	4.43	4.63	5.07	5.42
G = 2.0	4.59	4.78	4.77	4.97	5.17	5.69	6.31
G = 5.0	8.52	9.41	10.23	10.56	11.45	13.89	15.73
G = 10.0	14.48	15.85	17.19	17.96	19.37	24.06	27.76

	VCN		RAFT		EGOF-Net (Ours)
Data Name	EPE	3PE	EPE	3PE	EPE	3PE
Adirondack	4.59	29.73	1.00	3.61	0.84	2.95
Jadeplant	115.07	72.26	27.62	33.22	7.42	22.75
Motorcycle	9.67	47.67	1.34	5.97	1.31	5.74
Piano	8.16	27.56	1.11	7.53	0.99	5.81
Pipes	13.51	40.54	2.35	10.18	2.01	9.08
Playroom	24.19	79.86	2.10	13.70	1.96	13.40
Playtable	2.07	14.42	1.22	8.73	1.13	6.61
Recycle	2.23	15.23	1.55	9.09	1.08	6.97
Shelves	64.76	74.08	2.02	16.21	1.71	12.48
Vintage	224.79	95.21	3.69	24.48	2.73	16.83

Error Metric	VCN	RAFT	EGOF-Net (Ours)
EPE (Pixel)	4.98	1.25	0.95
3PE (%)	18.43	5.52	4.37

Error Metric	VCN	RAFT	EGOF-Net (Ours)
EPE (Pixel)	4.98	1.25	0.95
3PE (%)	18.43	5.52	4.37