LiDAR-camera system extrinsic calibration by establishing virtual point correspondences from pseudo calibration objects

Pei An; Pei An; Yingshuo Gao; Yingshuo Gao; Tao Ma; Tao Ma; Kun Yu; Kun Yu; Bin Fang; Bin Fang; Jun Zhang; Jun Zhang; Jie Ma; Jie Ma

doi:10.1364/OE.394331

1. Introduction

Multi-sensor optical systems are essential modules in visual applications, such as light field imaging system [1,2], array camera imaging system [3,4], stereo vision measurement system [5], spacecraft optical system [6] and 3D Light Detection And Ranging (LiDAR) and camera (LiDAR-camera) system [7]. Among these optical systems, LiDAR-camera system is rapidly applied in the fields of remote sensing [8] and robotic vision [9]. Based on the distances measured by light beams, LiDAR detects the environment surrounding by means of generating sparse point clouds. Monocular camera captures 2D dense RGB information in its field of view (FOV). By fusing the LiDAR data and imagery, LiDAR-camera system can deal with advanced perception vision tasks, such as 3D object detection [10,11]. However, sensor fusion on LiDAR-camera system requires the rigid transformation of camera and LiDAR coordinate systems. This rigid transformation is defined as the extrinsic parameters of LiDAR-camera system. It can be represented by a rotation matrix $\mathbf {R}$ and a translation vector $T$. $\mathbf {R}$ and $T$ are the extrinsic parameters of LiDAR-camera system. Extrinsic calibration aims to estimate them.

Finding the associated features to establish points correspondences is the key of extrinsic calibration [8,12]. However, for the commodity level LiDAR, the horizontal and vertical angle resolutions have large differences, thus causing the sparsity of LiDAR point cloud. Velodyne-HDL-64E S3 LiDAR is taken as example. Its horizontal angle resolutions is $0.08^\circ$, while its vertical angle resolution is five time larger as $0.4^\circ$. Although corner features and edge features are easily detected in RGB image, it is difficult to extract the corresponding features in the sparse LiDAR point cloud. Hence, it is still a challenge to calibrate the extrinsic parameters of LiDAR-camera system. To establish point correspondences, traditional methods [12–15] attempt to design specific calibration objects that are both easily identified in the LiDAR point cloud and RGB image. However, for the general natural scene, there are no specific calibration objects so that these methods fail to work. Therefore, an extrinsic calibration method without any calibration objects is required.

Motivated by this, we propose the virtual point correspondence at the first time, and then present a calibration method using virtual point correspondences. We find that some of objects in a natural scene can provide calibration information if satisfying three conditions. These objects are regarded as pseudo calibration objects. Virtual point correspondences can be established from 2D box corner points of one pseudo calibration object in RGB image and its corresponding 3D frustum box corner points in point cloud. It is the approximation of the real point correspondence. Our method requires two calibration conditions: (i) pseudo calibration objects and (ii) initial guess of extrinsic parameters. They are easily satisfied in the practical application. In the scheme of our method, we present a automatic pipeline to extract 3D frustum boxes of pseudo calibration objects in the sparse LiDAR points cloud. After that, a geometrical optimization scheme is proposed to estimate the extrinsic parameters with virtual point correspondences. Simulations and dataset experiments show that the presented optimization scheme outperforms current calibration optimization methods. Real data experiments demonstrate that the proposed method achieves more accurate results than state-of-the-art calibration object based method. It is verified that our method can be effectively used for LiDAR-camera system in the practical applications.

The remainder of this paper is organized as follows. We briefly survey the current LiDAR-camera calibration methods in Sec. 2. After that, the proposed virtual point correspondence and the novel calibration method are presented by details in Sec. 3. In Sec. 4, experiment settings and calibration results of simulations, dataset, and real data experiments are presented and discussed. Finally, we conclude our work in Sec. 5.

2. Related works

2.1 Calibration methods

Researchers have recently done lots of works for calibration on LiDAR-camera system. Current LiDAR-camera extrinsic calibration methods can be roughly classified as three categories, such as calibration object based methods [12–16], information fusion based methods [8,17–21], and deep learning based methods [22–24].

With designing calibration objects identifiable both to LiDAR point cloud and RGB image, calibration object based methods attempt to establish point correspondences from the corner points of calibration objects. After that, the calibration problem is converted as a classical pose estimation problem, which can be solved by Perspective-n-Point (PnP) algorithm [25] and Bundle Adjustment (BA) optimization scheme [26]. Calibration objects can be roughly classified into two categories, such as 2D calibration objects (polygonal planar boards [13,14], planar boards with holes [16], hybrid planar boards [12]), and 3D calibration objects (ordinary boxes [15]). Accuracy of calibration object based method is dependent on the number of corner points provided by the calibration objects.

Information fusion based methods estimate the extrinsic parameters by finding the connection between the information of RGB image and LiDAR data. Ishikawa et al. [17] estimated extrinsic parameters by fusing the information of visual odometry and LiDAR odometry. LiDAR odometry is estimated via iterative closet point (ICP) algorithms [27,28]. However, for ICP algorithm, due to the sparsity of LiDAR point cloud, the matched 3D points are difficult to find and even incorrect sometimes, thus causing inaccurate pose estimation results. Caselitz et al. [18] used a visual odometry based system to reconstruct a sparse set of 3D points. These points are matched against the LiDAR point cloud to obtain the extrinsic parameters of the LiDAR-camera system. Inspired by the mutual information (MI) applied in the field of remote sensing, Pandey et al. [19] used MI of RGB image and LiDAR intensity image for extrinsic parameters calibration. Wolcott et al. [20] used normalized MI to improve the accuracy of calibration. However, the sparsity of LiDAR point cloud would cause large errors in the procedure of MI computation. Neubert et al. [21] considered that the features with depth changes are likely to create visual gradients, and they presented a calibration related loss function with considering the similarity between RGB image and LiDAR depth. Zhang et al. [8] extracted the common boundaries of RGB image and projected LiDAR depth for registration.

Inspired by the similarity of the RGB image and the projected LiDAR depth, current deep learning based methods feed RGB image and LiDAR depth into well-designed convolutional neural network (CNN) for extrinsic calibration. RegNet [22] is the first deep CNN to regress the extrinsic parameters of LiDAR-camera system. In RegNet, blocks of Network in Network (NiN) [29] are used to extract the features of RGB image and LiDAR depth. CalibNet [23] is a deep CNN to infer the 6-DoF rigid transformation between LiDAR and camera. In the training procedure, CalibNet is trained to predict extrinsic parameters by maximizing the geometric and photometric consistency of RGB image and LiDAR point cloud. Inspired by the CNN architecture of optical flow estimation method PWC-Net [24], CMR-Net [30] exploits six pyramid convolution layers to compute the feature maps of input images, and then applies two fully connected layers to regress the extrinsic parameters.

2.2 Discussion

Summaries of the discussed methods are listed in Fig. 1. As there are no special designed calibration objects in the practical application, calibration object based method have to extract point correspondences of objects in the natural scene. However, due to the complex shape of objects in the natural scene, accurate point correspondences are difficult to establish from RGB image and LiDAR point cloud. Therefore, calibration object based methods cannot work in online case. Method using odometry fusion requires multi-frame of LiDAR point cloud and RGB images to estimate both LiDAR odometry and visual odometry. It cannot work with only one frame of LiDAR point cloud and RGB image. Besides, the accuracy of visual odometry is sensitive to illumination change. So, method using odometry fusion might be not robust in the actual application. Deep learning based methods run fast in online case. They need sufficient training data to improve their generalization ability. However, before using deep learning based method, calibration CNN needs training. To prepare training dataset, one must collect plenty of LiDAR-camera sensor data with ground truths of calibration results. Hence, it is inconvenient to exploit deep learning based calibration method in the practical situations. Methods using geometrical features also face some challenges in the outdoor scene. For Zhang et al method [8] and MI based methods [19,20], these methods require to match the object contour in RGB image and LiDAR point cloud for calibration. However, due to the sparsity of LiDAR point cloud, object contour in LiDAR point cloud is obscure. And boundaries of some objects in RGB image are sensitive to illumination change. So, method using geometrical features might not be stable in the practical application. Therefore, these methods are still hard to achieve a robust and accurate calibration results. To deal with calibration in the environment without any calibration objects, we present a novel method using virtual point correspondences. Our method belongs to the information fusion based methods. It does not need training, and can be used in cases both of one frame and multi-frame of LiDAR-camera sensor data. It works in both offline and online cases. Compared with current geometrical feature based methods, our method is relatively robust to illumination change, with achieving accurate calibration results.

Fig. 1. Summary of current LiDAR-camera extrinsic calibration methods. Offline case means the laboratory environment with calibration objects. Online case denotes the environment in practical application that has no calibration objects. Training denotes that parameters of the method require training. Multi-frame represents the method needs more than one frames of RGB images and LiDAR data for calibration.

Download Full Size | PDF

3. Proposed method

In this section, definition of virtual point correspondence is illustrated at first. Pose estimation using virtual point correspondences is discussed. Definition of pseudo calibration object, calibration conditions, and calibration scheme are presented later.

3.1 Virtual point correspondence

We briefly introduce the real point correspondence at first. It is used for calibration object based methods. Supposed that a corner point of one calibration object is observed by LiDAR-camera system. Let $P=(x,y,z)^T$ be the position of the corner point in the LiDAR coordinate system. $I=(u,v,1)^T$ is the pixel coordinate of this corner point in the image plane. A real point correspondence of $I$ and $P$ are established using perspective projection [31], as shown in Eq. (1).

(1)$$ZI=\mathbf{K}(\mathbf{R}P+T)$$

(2)$$\mathbf{K}={ \left( \begin{array}{ccc} f_{x} &0 &c_x\\ 0 &f_{y} &c_y\\ 0 &0 &1 \end{array} \right )}$$

where $Z$ is the depth of the corner point. $\mathbf {K}$ is the intrinsic matrix of the camera. $f_x$ and $f_y$ are the focal lengths in the horizontal and vertical image plane axis, respectively. $(c_x,c_y,1)^T$ is the pixel coordinate of the principal point. $\mathbf {R}$ and $T$ are the extrinsic parameters of LiDAR-camera system, which need to be calibrated. For discussion simplicity, a mapping $\pi (\cdot )$ is used to describe the perspective projection in Eq. (1), as $I=\pi (\mathbf {R}P+T)$. Due to the geometry property of man-made calibration object, the corner points are easily detected from both camera and LiDAR sensor.

However, there are no specific calibration objects in the practical application. As shapes of most objects in the natural scene are complex, it is difficult to establish accurate real point correspondences from RGB image and LiDAR point cloud. Therefore, we attempt to present the virtual point correspondence. Theoretically, objects that are fallen into the FOV of LiDAR-camera system can be used to establish virtual point correspondences. We use one of these object $C_i$ for illustration. Supposed that $C_i$ is detected in the RGB image, covered by a 2D bounding rectangular box $RB_i$. $RB_i^j=(u_i^j,v_i^j,1)^T$ is its corner point ($j=1,\ldots ,4$). $C_i$ is also detected in LiDAR point cloud, covered by a 3D bounding frustum box $FB_i$. $FB_i^k=(x_i^k,y_i^k,z_i^k)^T$ is its corner point ($k=1,\ldots ,8$). There are presented in Fig. 2. The virtual point correspondence is established from the 2D box $RB_i$ and the 3D frustum box $FB_i$ if satisfying the condition in Eq. (3). As $RB_i^j$ and $FB_i^j$ do not exist in the real scene, it is noted that virtual point correspondence also does not exist in the real scene, which is the approximation of real point correspondence.

(3)$$RB_i^j = \pi (\mathbf{R}FB_i^j+T) = \pi (\mathbf{R}FB_i^{j+4}+T)$$

Fig. 2. Representation of virtual point correspondences and pose estimation using virtual point correspondences. From the 2D box and its corresponding 3D frustum box, virtual point correspondences are established from $RB_i^j$ to $FB_i^j$, and $RB_i^j$ to $FB_i^{j+4}$ ($j=1,\ldots ,4$). $\mathbf {R}_{est}$ and $T_{est}$ are the extrinsic parameters estimated by our method.

Download Full Size | PDF

In summary, for the situation without specific calibration objects, virtual point correspondence is superior to real point correspondence, for the 2D boxes and 3D frustum boxes of objects are relatively easy to extract [32]. Approach of extracting 2D box and 3D frustum box is discussed with details in Sec. 3.3.

3.2 Pose Estimation using virtual point correspondences

With the virtual point correspondences, a geometrical optimization scheme is proposed to estimate the extrinsic parameters of LiDAR-camera system. It is represented in Fig. 2. Supposed that there are totally $n$ matched 2D boxes $\{RB_i\}_{i=1}^n$ and 3D frustum boxes $\{FB_i\}_{i=1}^n$. In this paper, we present a novel loss function $E(\textbf {R},T)$ in Eq. (4), which focuses on the maximum reprojection error of $FB_i^j$ and $FB_i^{j+4}$. Compared with loss function $E_{BA}(\textbf {R},T)$ [26] in Eq. (5) that focuses on the average reprojection error, the main advantage of $E(\textbf {R},T)$ is that it is the upper bound of $E_{BA}(\textbf {R},T)$. Thus minimizing $E(\textbf {R},T)$ can obtain more accurate extrinsic parameters. Therefore, the proposed loss function is suitable for estimating pose from virtual point correspondences. Experiments in Sec. 4.1.2 also show that $E(\cdot )$ is more suitable than $E_{BA}(\cdot )$ for the virtual point correspondences.

(4)$$\begin{aligned} E(\mathbf{R}, T) = \sum_{i=1}^{n} \sum_{j=1}^{4} \max(\Vert RB_i^j & - \pi(TFB_i^j) \Vert_2^2, \Vert RB_i^j - \pi(TFB_i^{j+4}) \Vert_2^2) \\ TFB_i^j & = \mathbf{R} \cdot FB_i^j + T \end{aligned}$$

(5)$$E_{BA}(\mathbf{R}, T) = \sum_{i=1}^{n} \sum_{j=1}^{4} \Vert RB_i^j - \pi(TFB_i^j) \Vert_2^2/2 + \Vert RB_i^j - \pi(TFB_i^{j+4}) \Vert_2^2/2$$

For the detail of implementation, due to the orthogonal constraints of the rotation matrix $\mathbf {R}$ that $\mathbf {R}^T\mathbf {R} = \mathbf {I}$ and $\det (\mathbf {R})=1$, it is inconvenient to optimize $\mathbf {R}$ directly. $\mathbf {R}$ can be converted into a $3\times 1$ vector $\theta$ with using Rodriguez formula [33]. We concatenate $\theta$ and $T$ to a $6\times 1$ vector $\xi$. Using levenberg-marquardt algorithm [34], $\xi$ is optimized by minimizing function $E(\cdot )$. It is implemented by the least square optimization toolbox of Scipy.

3.3 Extrinsic calibration using virtual point correspondences

3.3.1 Pseudo calibration object and calibration conditions

Theoretically, one object can be regarded as pseudo calibration object if satisfying three conditions: (i) this object is in the FOV of LiDAR-camera system; (ii) this object is not obscured by other objects; (iii) area of this object projected into the image plane is simple-connected. Pseudo calibration objects can be easily found in the practical application, such as car, cyclist, pedestrian, wall, chair, table, and box. Some of them are shown in Fig. 3. Due to its complex shape, pseudo calibration object only provides virtual point correspondences. They are used for the situation without specific calibration objects.

Fig. 3. Flowchart of the proposed calibration method. Using normal guided foreground detection and 2D object detection approaches, matched 2D boxes in RGB image and 3D frustum boxes in LiDAR point cloud are extracted. They provide virtual point correspondences, and are used for extrinsic calibration.

Download Full Size | PDF

Two conditions of the proposed calibration method are discussed: (i) pseudo calibration objects and (ii) initial guess of extrinsic parameters. The first condition is that pseudo calibration objects are required to provide virtual point correspondences. The second condition is that the initial extrinsic parameters $\mathbf {R}_{ini}$, $T_{ini}$ are required to compute a unaligned LiDAR depth. With the initial guess, for one pseudo calibration object, 2D box sizes detected in RGB image and LiDAR depth do not have large differences, thus making the virtual point correspondences robust and accurate. In the practical applications, initial parameters can be estimated with method [13]. Due to the iterative refinement discussed in Sec. 3.3.5, our method can achieve stable and robust calibration performance even if the error of initial parameters is large. Although our method needs two conditions for calibration, they are easily satisfied in the actual applications.

3.3.2 Overview of the proposed method

Our calibration method takes RGB image, LiDAR point cloud and rough extrinsic parameters $\mathbf {R}_{ini}$, $T_{ini}$ as inputs, outputs accurate extrinsic parameters $\mathbf {R}_{est}$, $T_{est}$. Flowchart of the proposed LiDAR-camera extrinsic calibration method is presented in Fig. 3. Using $\mathbf {R}_{ini}$ and $T_{ini}$, the unaligned LiDAR depth is computed via Eq. (1). Our method takes RGB image and unaligned LiDAR depth as inputs, and finds virtual point correspondences from 2D boxes of pseudo calibration objects in RGB image and its corresponding 3D frustum boxes in LiDAR point cloud. After that, extrinsic parameters are optimized from Eq. (4) with virtual point correspondences. Considering the 2D boxes detected in the unaligned LiDAR depth are not accurate enough, iterative refinement is exploited to make calibration results more precise and robust.

3.3.3 Pseudo calibration object detection from LiDAR depth

3D frustum box of pseudo calibration object is required to establish virtual point correspondence. We propose a automatic and multi-stage approach to extract 3D frustum box of pseudo calibration objects from the un-aligned LiDAR depth. Due to the mechanism of LiDAR, point cloud of one object is sparser if it is farther away from LiDAR. Although the range of LiDAR is above 100m, point cloud of the object far from the sensor is low-resolution, which cannot be used for calibration. Therefore, only the pseudo calibration objects in the foreground are used for calibration. The proposed detection method is divided into two parts: (i) depth completion and foreground segmentation and (ii) normal guided foreground detection.

3.3.3.1 Depth completion and foreground segmentation

Approach of depth completion and foreground segmentation is represented in Fig. 4. Considering dense depth is better for object detection, we exploit a depth completion approach inspired by method [35]. Let $W \times H$ matrix $\mathbf {D}_{raw}$ be the sparse LiDAR depth. $W$ and $H$ are width and height of the image, respectively. To preserve the depth information of foreground objects, we use Eq. (6) to invert the depth to obtain $\mathbf {D}_{raw}^{inv}$.

(6)$$\mathbf{D}_{raw}^{inv}(i,j) = d_{max} - \mathbf{D}_{raw}(i,j)$$

where $i$ and $j$ are the row and column pixel index, respectively. $d_{max}$ is the max value in $\mathbf {D}_{raw}$. $\mathbf {D}_{raw}^{inv}$ gets denser after dilate operation. It is then smoothed by median blur filter, and bilateral filter. Dense depth $\mathbf {D}_{den}$ is obtained by inverting $\mathbf {D}_{raw}^{inv}$ again using Eq. (6). After that, we use Euclidean cluster method to split $\mathbf {D}_{den}$ into two clusters $\mathbf {D}_{c1}$ and $\mathbf {D}_{c2}$. Average depths of two clusters are computed as $\bar {d}_{c1}$ and $\bar {d}_{c2}$, respectively. Foreground depth $\mathbf {D}_f$ is determined as the cluster with the minimal average depth.

Fig. 4. Flowchart of depth completion and foreground segmentation.

Download Full Size | PDF

3.3.3.2 Normal guided foreground detection

Approach of normal guided foreground detection is presented in Fig. 5. For discussion simplicity, it can be divided into two sub-procedures: (1) cluster and (2) detection. In the first sub-procedure, we aim to extract the clusters of foreground objects. Ground points need to be filtered in advance. Traditional ground estimation method [36] is iterative and runs on CPU. It requires many hyper-parameters, not robust to detecting the ground in the unaligned depth. We propose a simple method to filter ground points with the guidance of normal. It requires fewer hyper-parameters, and is accelerated by graphics process unit (GPU). $\mathbf {D}_{f}$ is used to compute the foreground normal $\mathbf {N}_{f}$. A fast and robust approach of computing $\mathbf {N}_{f}$ is presented in Fig. 6. It is modified by method [37]. In order to estimate the normal vector of the i-th pixel, a $3 \times 3$ local window is used to obtain pixels $A_i$ and $B_i$ ($i=1,2,3$). Back-projecting these pixels with given depths in $\mathbf {D}_{f}$ obtains the corresponding points $P_{A}^i$ and $P_{B}^{i}$ via Eq. (7).

(7)$$z=d, x=\frac{d(u-u_0)}{f_u}, y=\frac{d(v-v_0)}{f_v}$$

where $d$ is the pixel depth. $u$ and $v$ are the pixel coordinates. $(x,y,z)^T$ are the coordinates of the corresponding point. From Fig. 6, the normal of the i-th pixel is approximately regarded as the normal of the neighbored triangular, such as $\triangle P_{A}^1P_{A}^2P_{A}^3$ and $\triangle P_{B}^1P_{B}^2P_{B}^3$, marked as $n_p^A$ and $n_p^B$, respectively. For example, $n_p^A$ is computed via Eq. (8). Due to the measurement noise, $n_p^A$ and $n_p^B$ might not be equal. We estimate the i-th pixel normal vector $n_p$ as the average $(n_p^A+n_p^B)/\Vert n_p^A+n_p^B \Vert _2$. The procedure of computing $n_p$ is accelerated by GPU.

(8)$$n_p^A = \frac{\overrightarrow{P_A^1P_A^2} \times \overrightarrow{P_A^1P_A^3}}{\Vert \overrightarrow{P_A^1P_A^2} \times \overrightarrow{P_A^1P_A^3} \Vert_2}$$

$\mathbf {N}_{f}$ is used to remove the ground points in foreground scene. As normals of ground points are nearly vertical to the ground, ground points can be removed by setting a threshold $n_{th}=(n_{x,th}, n_{y,th}, n_{z,th})^T$. It is also accelerated by GPU. Filtered foreground normal $\mathbf {N}_{df}$ is obtained using the removal criteria shown in Eq. (9). After that, $\mathbf {N}_{df}$ is processed by dilate operation, and smoothed by median blur algorithm. $\mathbf {N}_{df}$ is used to cluster foreground objects, for the ground points and background objects are all removed in previous procedures. Common cluster method, such as Euclidean cluster method, is applied to extract clusters $\{C_i\}_{i=1}^{k}$ from image $\mathbf {N}_{df}$. $C_{i}$ is the pixel sets that belongs to i-th class. $k$ is the sum of clusters.

(9)$$\mathbf{N}_{df} = \{ n_{i}=(n_{x,i},n_{y,i},n_{z,i})^T \in \mathbf{N}_{f} | \vert n_{x,i} \vert \leq n_{x,th}, \vert n_{y,i} \vert \leq n_{y,th}, \vert n_{z,i} \vert \geq n_{z,th}\}$$

Fig. 5. Flowchart of normal guided foreground detection.

Download Full Size | PDF

Fig. 6. Representation of fast and robust normal computation. (a) Local 3$\times$3 pixel patch of the i-th pixel. (b) Normal vector of the i-th pixel is the average of the normals of neighbored triangulares.

Download Full Size | PDF

In the second sub-procedure, we aim to extract the 2D boxes and 3D frustum boxes from cluster sets $\{C_i\}_{i=1}^{k}$. Let $u_{max}^i$, $u_{min}^i$, $v_{max}^i$, and $v_{min}^i$ be the maximum and minimum pixels of $C_i$ in u-axis and v-axis, respectively. A bounding rectangular box $BB_{i}$ is obtained on $C_i$. Four corners points $\{BB_{i}^{j}\}_{j=1}^4$ are presented in Eq. (10). Considering that $\mathbf {N}_{df}$ might loss edge information, $BB_{i}$ is not accurate as shown in Fig. 7. In the procedure of 2D box refinement, $BB_{i}$ is refined with the help of $\mathbf {D}_{f}$. We search the area $S_i=\{(u,v,1)^T| u_{min}^i \leq u \leq u_{max}^i, v \leq v_{min}^i\}$ in $\mathbf {D}_{f}$, and find the pixel $I_i^*=(u_i^*,v_i^*,1)^T$ that has depth and minimal v-axis value. $BB_{i}^{3}$ and $BB_{i}^{4}$ is refined by updating $v_*^i$ to $v_{min}^i$ in Eq. (10). Refined result is presented in Fig. 7. After that, 3D frustum box $FB_i$ is extracted from $BB_i$. Let $\{FB_i^k\}_{k=1}^8$ be its corner point. According to foreground depth $\mathbf {D}_{f}$, the depth range $[d_{min}^i, d_{max}^i]$ of cluster $C_i$ can be determined. Back-projecting $BB_i^j$ ($j=1,\ldots ,4$) with $d_{min}^i$ and $d_{max}^i$ via Eq. (7) can obtain $FB_i^{j}$ and $FB_i^{j+4}$.

(10)$$\begin{aligned} BB_{i}^{1}=(u_{min}^i, v_{max}^i, 1)^T, BB_{i}^{2}=(u_{max}^i, v_{max}^i, 1)^T \\ BB_{i}^{3}=(u_{min}^i, v_{min}^i, 1)^T, BB_{i}^{4}=(u_{max}^i, v_{min}^i, 1)^T \end{aligned}$$

Fig. 7. Example of refining 2D box of one cluster.

Download Full Size | PDF

3.3.4 Virtual point correspondences establishment

From Fig. 2, 2D box of corresponding object $RB_i$ in RGB image is required to establish virtual point correspondence. Two common approaches of extracting 2D boxes of objects can be exploited in the proposed calibration framework. The first approach is to use learning based 2D object detection method, such as Yolo v3 [38]. It is automatic. However, learning method requires sufficient training dataset to improve their generalization ability. Output of learning method might not be stable and accurate in various calibration scenes. The second approach is to extract 2D boxes of objects with the aid of human using an interactive annotation approach [39]. Although this approach costs time, it is robust and precise at different calibration situations. Therefore, the second approach is applied in our method to extract 2D boxes $\{PB_j\}_{j=1}^{m}$ in RGB image. $m$ is the number of detected objects.

To establish virtual point correspondences, the next problem is to find $\{RB_i\}_{i=1}^{k}$ from $\{PB_j\}_{j=1}^{m}$. As LiDAR depth is computed by initial pose $\mathbf {R}_{ini}$ and $T_{ini}$, the boxes sizes of the same object detected in LiDAR depth and RGB image do not have large differences. Therefore, this problem can be solved by matching $\{BB_i\}_{i=1}^{k}$ and $\{PB_j\}_{j=1}^{m}$. A metric function $\phi (\cdot )$ of two boxes is defined as Eq. (11). $l(\cdot )$ and $w(\cdot )$ are the length and width of the rectangular 2D box, respectively. The procedure of finding $RB_i$ is discussed in the following. Among $\{PB_j\}_{j=1}^{m}$, $PB_k$ is selected as the best match of $BB_i$ if satisfying Eq. (12) and $\phi (PB_k, BB_i) \leq \phi _{th}$. $\phi _{th}$ is a matching threshold, which can be set as 50.0 pixels. After that, $RB_i$ is set as $PB_k$. To prevent multi-to-one matching situation, only the match with the lowest $\phi (\cdot )$ is recognized as best match. After extracting $\{RB_i\}_{i=1}^k$ and $\{FB_i\}_{i=1}^k$, the virtual point correspondences are established as shown in Fig. 2. Then $\mathbf {R}_{opt}$ and $T_{opt}$ are obtained by minimizing $F(\cdot )$ in Eq. (4). Extrinsic parameters $\mathbf {R}_{est}$ and $T_{est}$ are obtained in Eq. (13).

(11)$$\phi(PB_i, BB_j) = \vert l(PB_i) - l(BB_j) \vert + \vert w(PB_i) - w(BB_j) \vert$$

(12)$$PB_k = \arg\min_{PB_j} \phi(PB_j, BB_i)$$

(13)$${\left( \begin{array}{cc} \mathbf{R}_{est} &T_{est} \\ 0 &1 \\ \end{array} \right )} = {\left( \begin{array}{cc} \mathbf{R}_{opt} &T_{opt} \\ 0 &1 \\ \end{array} \right )} {\left( \begin{array}{cc} \mathbf{R}_{ini} &T_{ini} \\ 0 &1 \\ \end{array} \right )}$$

3.3.5 Iterative refinement

Iterative refinement is widely applied in deep learning based calibration methods [23,30]. It is also used for the proposed method. As the 2D boxes detected in the unaligned LiDAR depth are not accurate, $\mathbf {R}_{est}$ and $T_{est}$ are also not precise enough. However, $\mathbf {R}_{est}$ and $T_{est}$ can be used to compute a relatively aligned LiDAR depth $\textbf {D}_{ali}$ via Eq. (1). Replacing $\textbf {D}_{raw}$ with $\textbf {D}_{ali}$, new virtual point correspondences are found via the previously discussed procedures in Sec. 3.3.3 and 3.3.4. After that, $\Delta \mathbf {R}_1$ and $\Delta T_1$ are computed with the new virtual point correspondences via minimizing Eq. (4). Theoretically, this procedure can be iterated at $t$ times. After that, the refined extrinsic parameters $\mathbf {R}_{ref}$ and $T_{ref}$ are computed in Eq. (14). For the practical applications, if the errors of $\mathbf {R}_{ini}$ and $T_{ini}$ are not large (e.g. rotation error smaller than 20$^\circ$ and position error smaller than 0.2m), we can set $t=1$ for the iterative refinement.

(14)$${\left( \begin{array}{cc} \mathbf{R}_{ref} &T_{ref} \\ 0 &1 \\ \end{array} \right )} = \prod_{i=1}^t{\left( \begin{array}{cc} \Delta\mathbf{R}_i &\Delta T_i \\ 0 &1 \\ \end{array} \right )} {\left( \begin{array}{cc} \mathbf{R}_{est} &T_{est} \\ 0 &1 \\ \end{array} \right )}$$

4. Experiments

4.1 Simulation

4.1.1 Experimental settings

Simulating experiments are conducted to evaluate the performance of pose estimation using virtual point correspondences. Virtual LiDAR-camera system is set up. Intrinsic parameters of the camera is shown in Table 1. Extrinsic parameters are $\mathbf {R}_{tru}$ and $T_{tru}$, respectively. There are $N_{obj}$ pseudo calibration objects in the simulation. Matched $\{RB_i\}_{i=1}^N$ and $\{FB_i\}_{i=1}^{N}$ are found in the simulation environment. Outputs of our method are $\mathbf {R}_{est}$ and $T_{est}$, respectively. In the following experiments, we evaluate the calibration results by measuring the rotation error $E_{rot}$, the translation error $E_{trans}$, and the mean reprojection error $E_{proj}$, as shown in Eqs. (15) and (16). $g(\cdot )$ converts a rotation matrix $\mathbf {R}$ to a angle vector $(\theta _x,\theta _y,\theta _z)^T$, which is discussed in Appendix. A.

(15)$$\begin{aligned} E_{rot} = \Vert g(\mathbf{R}_{tru}^T \mathbf{R}_{est}) \Vert_2 \\ E_{trans} = \Vert T_{tru} - T_{est}\Vert_2 \\ \end{aligned}$$

(16)$$E_{proj} = \frac{1}{8N} \sum_{i=1}^{N} \sum_{j=1}^{4} \Vert RB_i^j - \pi(\mathbf{R}_{est}FB_i^j+T_{est})\Vert_2 + \Vert RB_i^j - \pi(\mathbf{R}_{est}FB_i^{j+4}+T_{est})\Vert_2$$

Table 1. Intrinsic parameters of the virtual camera.

View Table | View all tables in this article

4.1.2 Performance with respect to pixel noise

This experiment investigates the performance with respect to the pixel noise. In the actual application, the 2D boxes detected in the LiDAR depth and the RGB image are not accurate enough. To simulate the practical situation, Gaussian pixel noises with zero mean and $\delta$ standard deviation are added on corner points in $\{RB_i\}_{i=1}^N$ and $\{BB_i\}_{i=1}^N$, respectively. $\delta$ varies from $[0.1, 0.5]$. For each $\delta$, 500 independent trials are preformed, and the average errors are shown on Fig. 8. With the fixed $N_{obj}$, it is found that the calibration result is more accurate when noise level $\delta$ is less. Therefore, for the better calibration results, it is necessary to extract accurate 2D boxes of pseudo calibration objects from the LiDAR depth and the RGB image. There are two tips to improve the accuracy of extracted 2D box from the LiDAR depth and the RGB image. The first tip is to make sure that the pseudo calibration object is close to LiDAR-camera system. If the pseudo calibration object is far from LiDAR-camera system, its point cloud would be very sparse, thus making it difficult to extract its accurate 2D box from LiDAR depth. The second tip is to use object with simple shape as pseudo calibration object. If the shape of one object is complex enough, it is hard to extract its 2D box from both LiDAR depth and RGB image.

Fig. 8. Results of average errors $E_{rot}$, $E_{trans}$, and $E_{proj}$ at different noise level $\delta$ and different numbers of objects $N_{obj}$.

Download Full Size | PDF

4.1.3 Performance with respect to object numbers

This experiment investigates the performance with respect to the number of detected pseudo calibration objects. As the number of pseudo calibration objects is limited in the practical application, it is essential to study the relation between calibration precision and numbers of objects $N_{obj}$. $N_{obj}$ varies from $[1, 10]$. From Fig. 8, tendencies of average error curves show that the accuracy of the calibration result is improved with more pseudo calibration objects. The reason is that the more objects detected, the more virtual point correspondences are provided for calibration, thus enhancing the robustness to pixel noise. It is also found that $E_{rot}$, $E_{trans}$, and $E_{proj}$ are decreased relatively slightly when $N_{obj} \geq 6$. Besides, the number of detected pseudo calibration objects in the natural scene is also limited. Therefore, for the actual application, using $N_{obj} \in [3,6]$ for calibration can obtain accurate enough extrinsic parameters.

4.1.4 Performance with respect to optimization schemes

This experiment investigates the performance with respect to different optimization schemes. The proposed optimization scheme is compared with the common approaches, such as EPnP [25], BA [26], and DLT [40]. Feeding the same virtual point correspondences, these methods are all evaluated at the same condition: $N_{obj}=5$ and $\delta =0.25$. 500 independent trials are preformed, and the distributions of all errors are presented in Fig. 9. Error distribution is computed via Gaussian kernel based distribution estimation method. The averages and deviations of $E_{rot}$ and $E_{trans}$ of our method are obviously smaller than other methods. Besides, we also evaluate these methods in a strict condition: $N_{obj}=2$ and $\delta =0.5$. Results are shown in Fig. 10. The average error of $E_{trans}$ and $E_{proj}$ of our method are slightly smaller than BA method. As for EPnP and DLT, these methods do not consider any optimization schemes to minimize the reprojection errors, thus causing them not robust to pixel noise. So, extrinsic parameter errors and reprojection errors of these method are larger than optimization methods, such as the proposed method and BA method. As discussed in Sec. 3.2, the loss function of the proposed optimization scheme is the upper bounder of the loss function in BA method. The proposed optimization scheme can obtain more accurate extrinsic parameters than BA method. Therefore, the proposed optimization scheme can estimate the robust and accurate calibration result with using the virtual point correspondences.

Fig. 9. Distributions of errors $E_{rot}$, $E_{trans}$, and $E_{proj}$ computed by different methods at the same noise level $\delta =0.25$ and numbers of objects $N_{obj}=5$.

Download Full Size | PDF

Fig. 10. Distributions of errors $E_{rot}$, $E_{trans}$, and $E_{proj}$ computed by different methods at the same noise level $\delta =0.50$ and numbers of objects $N_{obj}=2$.

Download Full Size | PDF

4.1.5 Performance with respect to different extrinsic parameters

This experiment investigates the performance with respect to different extrinsic parameters. $\mathbf {R}_{truth}$ can be represented by a angle vector $(\theta _x, \theta _y, \theta _z)^T$. $T_{truth}=(x,y,z)^T$. These angles and coordinates are all uniformly sampled from the interval $[-90^\circ , 90^\circ ]$ and [−10.0m, 10.0m], respectively. Under the normal and strict conditions, 1000 independent trials are preformed. Distribution of $E_{rot}$ and $E_{trans}$ are presented in Fig. 11. For the normal condition that $\delta =0.25$, $N_{obj}=4$, errors are located in the interval $[0^\circ , 0.03^\circ ]\times [0cm,0.6cm]$ for above 70$\%$ trails. As for the strict condition that $\delta =0.50$, $N_{obj}=2$, interval $[0^\circ , 0.15^\circ ]\times [0cm,4cm]$ contains above 70$\%$ trails. Therefore, the proposed method is robust to different extrinsic parameters in a large range.

Fig. 11. Distributions of errors $E_{rot}$, $E_{trans}$ at different extrinsic parameters under the condition that $\delta =0.25$, $N_{obj}=4$ (Left Image) and $\delta =0.50$, $N_{obj}=2$ (Right Image).

Download Full Size | PDF

4.2 Dataset experiments

4.2.1 Experimental setting

Dataset experiments are conducted to evaluate the performance of proposed calibration method in the real dataset. KITTI dataset [41] is the known dataset in the field of autonomous driving, for it has a large number of sequences with good scene variation. In dataset experiments, With sufficient training dataset on KITTI benchmark, an efficient model [38] is used to extract 2D boxes in RGB image with accuracy. The recording platform is equipped with four high resolution cameras (two gray cameras and two RGB cameras), a Velodyne LiDAR, and a state-of-the-art localization system. We aim to calibrate the extrinsic parameters of the LiDAR and the left RGB camera. Intrinsic parameters of the camera is shown in Table 2. Its resolution is 1242 pixel $\times$ 375 pixel. The ground truths of the extrinsic parameters are provided in KITTI dataset, marked as $\mathbf {R}_{tru}$ and $T_{tru}$. We use raw recordings in KITTI dataset, such as the RGB image and LiDAR point cloud. To evaluate the proposed method, rough extrinsic parameters $\mathbf {R}_{ini}$ and $T_{ini}$ are provided by applying a random noised rigid body transformation $\mathbf {R}_{noi}$ and $T_{noi}$ to the ground truth extrinsic parameters, as shown in Eq. (17). As the same as Sec.4.1.5, angles and coordinates in $\mathbf {R}_{noi}$ and $T_{noi}$ are all uniformly sampled from the interval $[-\theta , \theta ]$ (Unit: degree) and [-D, D] (Unit: meter), respectively. Error metrics, such as $E_{rot}$ and $E_{trans}$ presented in Eq. (15), are also applied in the following experiments.

(17)$$\begin{aligned} \mathbf{R}_{ini} &= \mathbf{R}_{noi} \mathbf{R}_{tru} \\ T_{ini} &= \mathbf{R}_{noi}T_{tru} + T_{noi}\\ \end{aligned}$$

Table 2. Intrinsic parameters of the left RGB camera in KITTI dataset.

View Table | View all tables in this article

4.2.2 Performance with respect to pseudo calibration object detection from LiDAR depth

This experiment investigates the performance with respect to pseudo calibration object detection from LiDAR depth. Proposed pseudo calibration object detection method is multi-stage. Foreground depth and normal are intermediate results used for detection, which are presented in Fig. 12. After obtaining 2D boxes of pseudo calibration objects from LiDAR depth, our method extracts 3D frustum boxes of the corresponding objects. Results are also shown in Fig. 12. The accuracy of calibration result is depended on the precision of 2D boxes detected in LiDAR depth. It can be found that 2D box detected from LiDAR depth covers the pseudo calibration object tightly. It means that the virtual point correspondence is established with accuracy. However, there also exist some situations that the proposed pseudo calibration object detection method performs not well and even gets wrong results, as presented in Fig. 13. They can be roughly divided into two cases: objects overlap and depth missing. Objects overlap means that objects are overlapped in the RGB image and LiDAR depth. The normal of foreground scene is so complex that our method fails to obtain correct numbers of clusters. Due to the mechanism of LiDAR, the light beam generated by LiDAR is not reflected when it hits a transparent object. LiDAR fails to generate the point cloud of transparent objects, such as the car windows. Depths and normals of transparent objects are both missing in the LiDAR depth, making our method difficult to detect pseudo calibration objects. Therefore, it is recommended to calibrate the LiDAR-camera system in the simple situation that has no object overlaps and missing depths.

Fig. 12. Results of proposed pseudo calibration object detection method at different scenes in KITTI dataset.

Download Full Size | PDF

Fig. 13. Wrong results of proposed pseudo calibration object detection method at two situations: objects overlap and depth missing.

Download Full Size | PDF

4.2.3 Performance with respect to calibration accuracy

This experiment investigates the performance with respect to calibration accuracy on the KITTI benchmark dataset. It is noted that there exists no previous method that uses the virtual point correspondence for calibration. For fair model comparison, the common pose estimation schemes, such as EPnP [25], BA [26], and DLT [40], and our method take the same virtual point correspondences as inputs to estimate extrinsic parameters. Considering the error between the initial rough pose and the ground truth pose, $\theta$ and $D$ can be set as $10^\circ$ and $1.0m$, respectively. Due to the discussion in Sec. 3.3.5, we set $t=1$ for the iterative refinement. Results are shown in Table 3. Combining the simulation results in Fig. 8, it is verified again that $N_{obj} \geq 2$ can get accurate calibration result, for the virtual point correspondences of more objects are robust to pixel errors in 2D box detection. From Table 3, using the iterative refinement and the proposed optimization scheme for virtual point correspondences, accuracy of the extrinsic parameters are improved.

Table 3. $E_{rot}$ and $E_{trans}$ of different methods with different object number $N_{obj}$ in KITTI validation dataset. "Iterative" and "Raw" means the proposed calibration with iterative refinement or not.

View Table | View all tables in this article

Moreover, the proposed method is also compared with current deep learning based calibration methods, such as CalibNet [23] and CMR-Net [30], on KITTI validation dataset. These methods do not require any calibration objects, and are trained on KITTI training dataset. Results are presented on Table 4. Our method outperforms CalibNet and CMR-Net. Main reason is that most of pixels in LiDAR depth have missing depths, thus making CNNs difficult to extract crucial features in LiDAR depth. The generalization ability of these learning based calibration method is limited. Hence, the proposed method achieves robust calibration results in practical applications.

Table 4. $E_{rot}$ and $E_{trans}$ of our method and deep learning based method in KITTI validation dataset.

View Table | View all tables in this article

4.2.4 Performance with respect to time consuming

This experiment investigates the performance with respect to time consuming. The proposed calibration method is multi-stage. It is essential to evaluate the runtime performance in the practical application. Average time consuming results of all modules in the proposed method are presented in Fig. 14. It is implemented with Python 3.7 on an Intel i7-4810MQ 2.80GHz CPU, Quadro K2100M GPU, 16.0GB memory Windows 2012 64-bit operating system, of a ThinkPad workstation laptop. Total runtime is nearly 1.35 seconds, while the proposed optimization scheme costs 61.8% of runtime. For the LiDAR-camera system in practical application, as the extrinsic parameters basically do not change at a long time, it is considered that the proposed method is suitable for LiDAR-camera extrinsic calibration.

Fig. 14. Results of runtime of all modules in the proposed calibration method.

Download Full Size | PDF

4.3 Real data experiments

4.3.1 Experimental setting

Real data experiments are mainly conducted to evaluate the performance of proposed method and the calibration object based method in the laboratory environment. LiDAR-camera system consists of a Velodyne-HDL-64E S3 LiDAR and Kinect v2 camera with a resolution of 1920 pixels $\times$ 1080 pixels. They are presented in Fig. 15. The intrinsic parameters of the camera are shown in Table. 5. Raw data of LiDAR point cloud and RGB images is recorded for calibration and test, as presented in Fig. 16. In the scenes for calibration, there are totally five objects which can be used for pseudo calibration objects. Objects 1 to 3 are planar boards with known sizes. Corner points of them can provide real point correspondences. Therefore, objects 1 to 3 can be used as calibration objects for calibration object based method. In the laboratory environment, the 2D boxes of pseudo calibration objects can be obtained accurately with the aid of human operation [39].

Fig. 15. LiDAR-camera system in real data experiments. (a) Velodyne-HDL-64E S3 LiDAR (b) Kinect v2 Camera.

Download Full Size | PDF

Fig. 16. Calibration scenes and test scenes captured by the LiDAR-camera system. Five objects are used in calibration scenes. Without accurate extrinsic parameters, the LiDAR depth is unaligned with the RGB image.

Download Full Size | PDF

Table 5. Intrinsic parameters of the RGB camera in real data experiments.

View Table | View all tables in this article

4.3.2 Performance with respect to calibration accuracy

This experiment investigates the performance of the proposed calibration method. For comparison, calibration object based methods Park et al [13] and Dhall et al [14] are applied in this experiment. They extract the corner points of calibration objects, such as planar rectangular objects 1 to 3 in Fig. 16. Each planar rectangular board provide four real point correspondences. Park et al method [13] establishes 3D-2D point correspondence from calibration objects. With using the depth image generated by Kinect v2 camera, Dhall et al [14] establishes 3D-3D point correspondence from calibration objects. The proposed method requires a initial guess of extrinsic parameters. It extracts the virtual point correspondences of objects 1 to 5. In the initial guess, $\mathbf {R}_{ini}$ is roughly estimated from the LiDAR coordinate system and the camera coordinate system. $\mathbf {T}_{ini}$ is set as $(0,0,0)^T$. Due to the discussion in Sec. 3.3.5, as the distance of two sensors is smaller than 20cm, we set $t=1$ for the iterative refinement in the proposed method. The camera position in the LiDAR coordinate system can be accurately measured. We evaluate the performance of each methods via computing the absolute errors of camera position at X-axis, Y-axis, and Z-axis, marked them as $E_x$, $E_y$, and $E_z$, respectively. Results of all methods are presented in Table 6. It is found that the proposed method achieves the more accurate camera position than Park et al method. The reason is that our method can utilize more objects to provide more point correspondences, e.g. five objects provide forty virtual point correspondences. Using plenty of point correspondences, our method is robust to the measurement noise in pixel and point cloud. Calibration object based methods used a limited number of calibration objects to establish real point correspondences, thus causing that their calibration results are sensitive to the measurement noise.

Table 6. Results and object usage of different methods in real data experiments. Number of corr. means the number of point correspondences that the method utilizes.

View Table | View all tables in this article

From Table 6, it can be found that our method using three frames of LiDAR-camera data achieves more accurate extrinsic parameters than only using one frame. Reason is that there are more pseudo calibration objects in multi-frame of LiDAR-camera data. It provides more virtual point correspondences, thus leading precise calibration results. Therefore, using multi-frame of LiDAR-camera data can improve the accuracy of the proposed method.

The visual calibration results of the proposed method are presented in Figs. 17 and 18. Using $\mathbf {R}_{ini}$ and $T_{ini}$, the unaligned LiDAR depths of three calibration scenes are obtained. In the unaligned LiDAR depth, 2D boxes of pseudo calibration objects are extracted to obtain corresponding 3D frustums boxes. 3D frustums boxes and 2D boxes of pseudo calibration objects in RGB image are applied to establish virtual point correspondences. After estimating the extrinsic parameters, these 3D frustum boxes are projected to the RGB image to compute reprojection error. From Fig. 17, projected 3D frustum boxes and 2D RGB boxes almost coincide. Average reprojection error is 2.20 pixels. From the three calibration scenes, calibrated LiDAR depths are basically aligned with RGB images. It demonstrates that the proposed method is accurate.

Fig. 17. Calibration results of the proposed method in three calibration scenes. 2D boxes of pseudo calibration objects in the unaligned depths are presented in the first row. Projected 3D frustums boxes (Blue line) and 2D boxes (Red line) of pseudo calibration objects in the RGB image are presented in the second row, which is regarded as visual reprojection error. Calibrated LiDAR depths are presented in the third row.

Download Full Size | PDF

Fig. 18. 3D frustums boxes of objects 1 to 5 in three calibration scenes.

Download Full Size | PDF

4.3.3 Visual results

For the three test scenes, the visual results of Park et al [13] method and our methods are presented on Fig. 19. Only using the initial guess, LiDAR depths is aligned incorrectly with the RGB images. For Park et al method, $E_y$ is so large that there exist horizontal drift between the LiDAR depth and the RGB image. Due to the small error of camera position, LiDAR depths computed via our method are nearly aligned with RGB images. From the visual calibration results on Figs. 17 and 19, it demonstrates that the proposed method can be used for the practical applications.

Fig. 19. Visual results of different methods in three test scenes.

Download Full Size | PDF

4.4 Discussions

From the simulations and dataset experiments, it is found that the accuracy of the proposed method is mainly dependent on the number of detected pseudo calibration objects. Inspired by the real data experiment, in the practical applications, we can extract many pseudo calibration objects from multi-frame LiDAR-camera data. One shortcoming of our method is that the proposed pseudo calibration object detection approach fails to work if the scene is complex enough. With the success of deep learning, we consider that learning based detection method, such as PointRCNN [42], can help the proposed object detection approach in some complex situations.

5. Conclusions

In this paper, we propose a novel extrinsic calibration method for LiDAR-camera system. It can work in the environment without calibration objects. Considering the real point correspondence is difficult to find in the natural scene, we propose the virtual point correspondence at the first time. It is the approximation of the real point correspondence. After that, we present a geometrical optimization scheme using virtual point correspondence, and then propose a novel extrinsic calibration method for LiDAR-camera system. It requires two calibration conditions, which can be easily satisfied in the practical application. Simulations, dataset experiments show that our method is robust and accurate. Real data experiments demonstrate that our method outperforms than state-of-the-art calibration object based method. It is verified that our method is applicable to the further advanced vision application.

Appendix

A. Convert a rotation matrix to a angle vector

Let $\mathbf {R}=(r_{ij})_{3\times 3}$ be a rotation matrix. It can be represented as the multiplication of three rotation matrix $\mathbf {R}(Z,\theta _z)\mathbf {R}(Y,\theta _y)\mathbf {R}(X,\theta _x)$. $\mathbf {R}(r,s)$ describes rotating $s$ angle around $r$ axis. The angle vector of $\mathbf {R}$ is $(\theta _x, \theta _y, \theta _z)^T$. Let $s=r_{11}^2 + r_{21}^2$. If $s > 0$, these angles are estimated from Eq. (18). If $s = 0$, these angles are estimated from Eq. (19).

(18)$$\theta_x = atan2(r_{32}, r_{33}), \theta_y = atan2(-r_{31}, s), \theta_z = atan2(r_{21}, r_{11})$$

(19)$$\theta_x = atan2(-r_{23}, r_{22}), \theta_y = atan2(-r_{31}, s), \theta_z = 0$$

Funding

National Natural Science Foundation of China (61991412, U1913602); Equipment Pre-Research Project (305050203, 41415020202, 41415020404).

Acknowledgments

The authors thank Siying Ke for providing many print suggestions, and thank Xianzhi Qi and Zhicheng Huang for collecting LiDAR-camera sensor data. The authors also appreciate anonymous reviewers for providing valuable and inspiring comments and suggestions.

Disclosures

The authors declare no conflicts of interest.

References

1. Z. Cai, X. Liu, X. Peng, and B. Z. Gao, “Ray calibration and phase mapping for structured-light-field 3D reconstruction,” Opt. Express 26(6), 7598–7613 (2018). [CrossRef]

2. Z. Cai, X. Liu, X. Peng, Y. Yin, A. Li, J. Wu, and B. Z. Gao, “Ray calibration and phase mapping for structured-light-field 3D reconstruction,” Opt. Express 24(18), 20324–20334 (2016). [CrossRef]

3. F. Abedi, Y. Yang, and Q. Liu, “Group geometric calibration and rectification for circular multi-camera imaging system,” Opt. Express 26(23), 30596–30613 (2018). [CrossRef]

4. L. Lilin, P. Zhiyong, and T. Dongdong, “Super multi-view three-dimensional display technique for portable devices,” Opt. Express 24(5), 4421–4430 (2016). [CrossRef]

5. Y. Cui, F. Zhou, Y. Wang, L. Liu, and H. Gao, “Precise calibration of binocular vision system used for vision measurement,” Opt. Express 22(8), 9134–9149 (2014). [CrossRef]

6. M. Wang, Y. Cheng, B. Yang, S. Jin, and H. Su, “On-orbit calibration approach for optical navigation camera in deep space exploration,” Opt. Express 24(5), 5536–5554 (2016). [CrossRef]

7. H. Di, H. Hua, Y. Cui, D. Hua, B. Li, and Y. Song, “Correction technology of a polarization lidar with a complex optical system,” J. Opt. Soc. Am. A 33(8), 1488–1494 (2016). [CrossRef]

8. W. Zhang, J. Zhao, M. Chen, Y. Chen, K. Yan, L. Li, J. Qi, X. Wang, J. Luo, and Q. Chu, “Registration of optical imagery and lidar data using an inherent geometrical constraint,” Opt. Express 23(6), 7694–7702 (2015). [CrossRef]

9. H. Di, H. Hua, Y. Cui, D. Hua, B. Li, and Y. Song, “Correction technology of a polarization lidar with a complex optical system,” J. Opt. Soc. Am. A 33(8), 1488–1494 (2016). [CrossRef]

10. C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2018), pp. 918–927.

11. J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3d proposal generation and object detection from view aggregation,” in Proceedings of International Conference on Intelligent Robots and Systems, (IEEE, 2018), pp. 1–8.

12. P. An, T. Ma, K. Yu, B. Fang, J. Zhang, W. Fu, and J. Ma, “Geometric calibration for lidar-camera system fusing 3d-2d and 3d-3d point correspondences,” Opt. Express 28(2), 2122–2141 (2020). [CrossRef]

13. Y. Park, S. Yun, C. Won, K. Cho, K. Um, and S. Sim, “Calibration between color camera and 3d lidar instruments with a polygonal planar board,” Sensors 14(3), 5333–5353 (2014). [CrossRef]

14. A. Dhall, K. Chelani, V. Radhakrishnan, and K. M. Krishna, “Lidar-camera calibration using 3d-3d point correspondences,” in arXiv:1705.09785, (2017), pp. 1–19.

15. Z. Pusztai and L. Hajder, “Accurate calibration of lidar-camera systems using ordinary boxes,” in Proceedings of IEEE International Conference on Computer Vision Workshops, (IEEE, 2017), pp. 394–402.

16. C. Guindel, J. Beltrán, D. Martín, and F. Garcia, “Automatic extrinsic calibration for lidar-stereo vehicle sensor setups,” (IEEE, 2017), pp. 1–6.

17. I. Ryoichi, O. Takeshi, and I. Katsushi, “Lidar and camera calibration using motion estimated by sensor fusion odometry,” in Proceedings of International Conference on Intelligence Robots and Systems, (IEEE, 2018), pp. 7342–7349.

18. T. Caselitz, B. Steder, M. Ruhnke, and W. Burgard, “Monocular camera localization in 3d lidar maps,” in Proceedings of International Conference on Intelligent Robots and Systems, (IEEE, 2016), pp. 1926–1931.

19. G. Pandey, J. R. Mcbride, S. Savarese, and R. M. Eustice, “Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information,” in Proceedings of the Twenty-Sixth Conference on Artificial Intelligence, (Academic, 2012), pp. 1–7.

20. R. W. Wolcott and R. M. Eustice, “Visual localization within lidar maps for automated urban driving,” in Proceedings of International Conference on Intelligent Robots and Systems (IEEE, 2014), pp. 176–183.

21. P. Neubert, S. Schubert, and P. Protzel, “Sampling-based methods for visual navigation in 3d maps by synthesizing depth images,” in Proceedings of International Conference on Intelligent Robots and Systems (IEEE, 2017), pp. 2492–2498.

22. N. Schneider, F. Piewak, C. Stiller, and U. Franke, “Regnet: Multimodal sensor registration using deep neural networks,” in Proceedings of Intelligent Vehicles Symposium, (IEEE, 2017), pp. 1803–1810.

23. G. Iyer, K. R. R. J. K. Murthy, and K. M. Krishna, “Calibnet: Self-supervised extrinsic calibration using 3d spatial transformer networks,” in Proceedings of International Conference on Intelligent Robots and Systems, (IEEE, 2018), pp. 1110–1117.

24. D. Sun, X. Yang, M. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2018), pp. 8934–8943.

25. V. Lepetit, F. M. Noguer, and P. Fua, “Epnp: An accurate o(n) solution to the pnp problem,” Int. J. Comput. Vis. 81(2), 155–166 (2009). [CrossRef]

26. B. Triggs, P. F. Mclauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment — a modern synthesis,” in Proceedings of Workshop on Vision Algorithms, (Academic, 2000), pp. 298–372.

27. Y. Ge, C. R. Maurer, and J. M. Fitzpatrick, “Surface-based 3-d image registration using the iterative closest point algorithm with a closest point transform,” Proc. SPIE 2710, 358–367 (1996). [CrossRef]

28. B. K. P. Horn, H. M. Hilden, and S. Negahdaripour, “Closed-form solution of absolute orientation using orthonormal matrices,” J. Opt. Soc. Am. A 5(7), 1127–1135 (1988). [CrossRef]

29. M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proceedings of 2nd International Conference on Learning Representations, (Academic, 2014), pp. 1–10.

30. D. Cattaneo, M. Vaghi, A. L. Ballardini, S. Fontana, D. G. Sorrenti, and W. Burgard, “Cmrnet: Camera to lidar-map registration,” in Proceedings of IEEE International Conference on Intelligent Transportation Systems, (IEEE, 2019), pp. 1283–1289.

31. Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Machine Intell. 22(11), 1330–1334 (2000). [CrossRef]

32. C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2018), pp. 918–927.

33. R. Harltey and A. Zisserman, Multiple view geometry in computer vision (2. ed) (Cambridge University Press, 2006).

34. J. More, “The levenberg-marquardt algorithm, implementation and theory,” Numerical Analysis 630, 105–116 (1977).

35. J. Ku, A. Harakeh, and S. L. Waslander, “In defense of classical image processing: Fast depth completion on the cpu,” in Proceedings of IEEE Conference on Computer and Robot Vision, (IEEE, 2018), pp. 16–22.

36. K. Zhang, S.-C. Chen, D. Whitman, M.-L. Shyu, J. Yan, and C. Zhang, “A progressive morphological filter for removing nonground measurements from airborne lidar data,” IEEE Trans. Geosci. Remote Sens. 41(4), 872–882 (2003). [CrossRef]

37. W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric constraints of virtual normal for depth prediction,” in CoRR abs/1907.12209, (2019), pp. 1–14.

38. J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” in CoRR abs/1804.02767, (2018), pp. 1–6.

39. H. Ling, J. Gao, A. Kar, W. Chen, and S. Fidler, “Fast interactive object annotation with curve-gcn,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 5257–5266.

40. Y. I. Abdel-Aziz and H. M. Karara, “Direct linear transformation into object shape coordinates in close-range photogrammetry,” in Proceedings of the Symposium on Close-Range Photogrammetry (Academic, 1971), pp. 1–18.

41. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2012), pp. 3354–3361.

42. S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2019), pp. 770–779.

Intrinsic Parameter	Value
$f_{x} / (p i x e l)$	480.0
$f_{y} / (p i x e l)$	420.0
$c_{x} / (p i x e l)$	260.0
$y_{y} / (p i x e l)$	240.0

Intrinsic Parameter	Value
$f_{x} / (p i x e l)$	718.856
$f_{y} / (p i x e l)$	718.856
$c_{x} / (p i x e l)$	607.193
$y_{y} / (p i x e l)$	185.216

	$N_{o b j}$ =4		$N_{o b j}$ =3		$N_{o b j}$ =2		$N_{o b j}$ =1
Method	$E_{r o t}$	$E_{t r a n s}$	$E_{r o t}$	$E_{t r a n s}$	$E_{r o t}$	$E_{t r a n s}$	$E_{r o t}$	$E_{t r a n s}$
EPnP [25]	1.039	0.118	1.264	0.166	2.643	0.356	6.184	0.921
BA [26]	0.979	0.089	1.170	0.141	2.475	0.343	5.002	0.868
DLT [40]	1.145	0.132	1.312	0.175	3.047	0.551	7.482	1.048
Our (Raw)	0.920	0.084	1.158	0.138	2.233	0.329	4.877	0.848
Our (Iterative)	0.845	0.079	1.154	0.133	2.076	0.311	4.601	0.827

Method	CMR-Net [30]	CalibNet [23]	Our Method
$E_{r o t}$	1.496	1.624	1.431
$E_{t r a n s}$	0.182	0.203	0.176

Intrinsic Parameter	Value
$f_{x} / (p i x e l)$	1094.04
$f_{y} / (p i x e l)$	1087.38
$c_{x} / (p i x e l)$	942.01
$y_{y} / (p i x e l)$	530.35

Method	Object usage	Number of corr.	$E_{x}$	$E_{y}$	$E_{z}$
Initial guess	-	0	7.04 cm	19.57 cm	4.58 cm
Park et al [13]	Objects 1 to 3	12	1.66 cm	9.05 cm	1.71 cm
Dhall et al [14]	Objects 1 to 3	12	1.43cm	8.34cm	1.59cm
Our Method	Objects 1 to 2	16	1.92cm	10.27cm	1.86cm
Our Method	Objects 1 to 4	32	1.25cm	6.88cm	1.14cm
Our Method	Objects 1 to 5	40	0.80 cm	5.42 cm	0.64 cm

Abstract

1. Introduction

2. Related works

2.1 Calibration methods

2.2 Discussion

3. Proposed method

3.1 Virtual point correspondence

3.2 Pose Estimation using virtual point correspondences

3.3 Extrinsic calibration using virtual point correspondences

3.3.1 Pseudo calibration object and calibration conditions

3.3.2 Overview of the proposed method

3.3.3 Pseudo calibration object detection from LiDAR depth

3.3.3.1 Depth completion and foreground segmentation

3.3.3.2 Normal guided foreground detection

3.3.4 Virtual point correspondences establishment

3.3.5 Iterative refinement

4. Experiments

4.1 Simulation

4.1.1 Experimental settings

4.1.2 Performance with respect to pixel noise

4.1.3 Performance with respect to object numbers

4.1.4 Performance with respect to optimization schemes

4.1.5 Performance with respect to different extrinsic parameters

4.2 Dataset experiments

4.2.1 Experimental setting

4.2.2 Performance with respect to pseudo calibration object detection from LiDAR depth

4.2.3 Performance with respect to calibration accuracy

4.2.4 Performance with respect to time consuming

4.3 Real data experiments

4.3.1 Experimental setting

4.3.2 Performance with respect to calibration accuracy

4.3.3 Visual results

4.4 Discussions

5. Conclusions

Appendix

A. Convert a rotation matrix to a angle vector

Funding

Acknowledgments

Disclosures

References

Cited By

Figures (19)

Tables (6)

Equations (19)

Optics Express