Image-fusion-based object detection using a time-of-flight camera

Dongzhao Yang; Tianxu Xu; Yiwen Zhang; Dong An; Qiang Wang; Zhongqi Pan; Guizhong Liu; Yang Yue

doi:10.1364/OE.510101

1. Introduction

Object detection has been a popular research field with the development of optical information processing and artificial intelligence (AI) in recent years. Since this technology could achieve target recognizing, tracking and status monitoring, it has been extensively used nowadays in the industrial production [1,2], agriculture [3,4], military [5–7], and transportation [8–10], etc. Image is one of the most important data types for object detection. Depending on the acquisition devices, images can be classified into RGB images, grayscale images, thermal infrared images, etc. However, each kind of image has its advantages and disadvantages. Data from a single source is commonly used for object detection at the outset [11–15], which leads to obvious limitations. Therefore, an increasing number of approaches focus on information fusion to achieve object detection. As a newly reliable and powerful technique, image fusion is an enhancement method, which allows useful information to be integrated in new image with appropriate fusion principles [16]. Besides, it could not only contain more comprehensive information and obtain complementary perception, but also assist to make more effective decision in various conditions. Consequently, it plays a vital role in image processing and machine vision sections.

Among different fusing combination types, thermal infrared and visible (TIR-V) fusion performs well in many scenarios [16]. Thermal infrared images capture radiation features from the targets, and the visible images typically reflect more details. The advantages of TIR-V fusion are manifested in many aspects, i.e., affluence in scene information and low-cost required equipment. However, TIR-V image pair is usually captured by different sensors, which causes potential problems on joint resolution. In addition, both the two image types are two-dimensional (2D) data, and this might be not enough for the target detection without the depth information. Accordingly, the depth and intensity (D-I) fusion, which introduces the range information into frameworks, has gained increasing attention recently. The RGB and depth images (RGB-D) is one of the most popular D-I fusion schemes. Many models and strategies of RGB-D fusion have been developed in previous works and perform well in a range of real-world applications, such as semantic segmentation [17], medical image segmentation [18], etc. Nevertheless, various illumination conditions are still a challenge for RGB-D fusion.

In summary, the existing fusing methods are still unable to achieve high accurate object detection using a single sensor. To address these issues, this paper presents a salient object detection method based on the fusion of the depth and intensity images from a time-of-flight (ToF) camera. Figure 1 provides an overview of the framework. Firstly, the depth and active infrared intensity (D-AII) are captured by a single ToF sensor simultaneously. These two data streams captured by the ToF sensor are independent from the natural lighting conditions in principle [19], which ensures the robustness of the demonstrated method under various lighting conditions, including darkness, strong illumination, or arbitrary natural lighting. Several literatures have already confirmed and further utilized this data characteristics from the ToF cameras [20,21]. Noted that most previous researches collected different images with multi-sensors, which means diverse focal lengths, resolutions, and viewpoints. In converse, single sensor could avoid complicated registration process in itself. Secondly, images from two modalities are fused in pixel-level with proposed algorithm to reach the foreground/background segmentation and localization. Finally, the spatial and vision features are extracted to construct a joint feature space in order to achieve object recognition. This method provides wide insight into vast application requirements for target detection, such as unmanned express sorting, office monitoring, and autonomous driving, etc.

Fig. 1. Framework overview and applications.

Download Full Size | PDF

The major contributions of this paper can be summarized as follows:

(1) a single ToF camera is utilized to collect images, which can reduce the size of sensor and simplify the data processing.
(2) the D-AII fusion could not only reflect vision and range information in the same high resolution, but also be robust to various environmental conditions.
(3) a multi-stage fusion is achieved on both the pixel- and feature- levels, which can maximize the extraction and retention of complementary information.
(4) the proposed method demonstrates a high localization precision and a promising identification performance with a 98.01% accuracy via K-Nearest Neighbor (KNN) method.

The rest of this paper is organized as follows: Section II reviews the related works on target detection with data fusion. Section III provides a detailed description of the developed approach. Section IV records the experimental setup and analyzes the experimental results. Finally, conclusions and future works are discussed in Section V.

2. Related works

Over the past few years, diverse data-fusion-based object detection methods have been developed to apply in different scenarios. To review these approaches in detail, we consider them from different perspectives as follows. (1) TIR-V fusion or D-I fusion by judging including 3D information or not. (2) Fusion levels, including pixel level, feature level, decision level, and others [22]. (3) Feature extraction strategies, including traditional method or deep neural network (DNN) models. In this section, we discuss TIR-V based and D-I based methods respectively, and analyze their specific technique details, such as fusion levels and feature extraction strategies.

2.1 TIR-V based methods

Traditional detection methods usually rely on extracting elements in the images manually. Schnelle et al. [23] presented a target detection and tracking method based on pixel-level infrared-visible fusion. The method relied on the Force Protection Surveillance System (FPSS) tracker and finding the brightest pixel in the difference-product image (DPI). Fendri et al. [24] proposed a TIR-V fusion method for robust moving object detection. It mainly applied the inter-frame differences in the background model of the low-level fused image.

There are also some works that use convolutional neural network (CNN) methods in the TIR-V fusion and detection process. Castillo et al. [25] presented a people detection system based on infrared and visible video fusion and INT3-Horus framework. Yao et al. [26] proposed a detection model, named IVF-Mask R-CNN. The model was based on the improved Mask R-CNN [27,28], matched and selected the images of sub-models in two data streams at the decision-making level. Xiao et al. [29] proposed a CNN-based infrared and visible image object detection network. The highlight is that the difference maximum loss function could maximize the difference between features from the two base CNNs and extract the complementary and diverse features.

2.2 D-I based methods

Rapus et al. [30] presented a system for pedestrian recognition through low-resolution 3D-Camera based depth and intensity measurements. This method utilized an AdaBoost head-shoulder detector to generate hypothesis about possible pedestrian positions. It further classified every hypothesis with AdaBoost or support vector machine (SVM) algorithm as pedestrian or not. Enzweiler et al. [31] proposed a multi-level Mixture-of-Experts framework for pedestrian classification. This method extracted complementary features from intensity, dense depth and dense flow data, and it could simplify the complex classification problem as manageable sub-problems. Makris et al. [32] proposed a part-based object recognition framework. The D-I fusion was performed on the local feature level, and the method increased robustness by narrowing the search over the possible detection scales. Beleznai et al. [33] presented a left luggage detection framework. The proposals provided by intensity and depth cues would be combined and validated to produce the final candidates. Keller et al. [34] presented a dense depth and intensity images based pedestrian detection system. This method extracted traditional spatial features and employ a D-I fusion on the classifier level directly. Additionally, there are also numerous target detection or tracking frameworks based on classic neural networks, i.e., VGG-16 [35,36], ResNet [37,38], etc. These methods mainly used RGB-D saliency detection networks.

Nevertheless, single sensor based high accuracy object detection is still a challenge. In this paper, we propose a D-AII image-fusion-based object detection framework with a single ToF device.

3. Method

Figure 2 illustrates the schematic diagram of the proposed method. In this research, we focus on object detection using an image-fusion method with a ToF depth camera. Here, six objects with different materials are selected for testing our proposed method (as Fig. 3 shows), including aluminum foil sticker (AFS); high-reflectivity mesh sticker (HMS); high-reflectivity red sticker (HRS); deep black paperboard (DBP); light blue paperboard (LBP); light brown cardboard (LBC). The ToF camera is used to collect the depth and intensity images. It should be noted that the above images obtained by the camera, whether they are depth images or intensity images, are all grayscale images. Because the image pairs are captured by the same lens, the registration can be omitted in the fusing process. The acquired images will be sent into the image processing and object detection framework. The framework mainly contains three parts. Firstly, depth and intensity images are fused with sliding-window weight fusion (SWWF) after data preprocessing. Next, optical features are extracted to establish database. Finally, object detection is achieved with feature database and machine learning methods.

Fig. 2. Image fusion algorithm framework. SWWF: sliding-window weight fusion.

Download Full Size | PDF

Fig. 3. Six materials in the experiment. AFS: aluminum foil sticker; HMS: high-reflectivity mesh sticker; HRS: high-reflectivity red sticker; DBP: deep black paperboard; LBP: light blue paperboard; LBC: light brown cardboard.

Download Full Size | PDF

3.1 Image pre-processing

Objects are often placed in a complex environment in real conditions. In order to simplify the detection process, the background information is suppressed to highlight the foreground. The background images can be obtained through certain methods, such as capturing directly or splicing by frames from different moments. Moreover, we hope that the foreground details could be enhanced so that the segmentation accuracy can be further improved. As Fig. 4 shows, an iterative background suppressing method is proposed. The background image model needs to be predefined or determined in advance using background capture techniques based on the specific scene situation where the objects are located. Firstly, an attenuation coefficient is assigned to the background image, and subtract the attenuated background image from the scene image. Then, we extract the details from the differential image with adaptive threshold approach and make differential image enhanced. The whole process can be expressed by

(1)$${I_{diff}} = {I_{S}}-d \times {I_{b}}$$

(2)$${I_{dtl}} = {F_{at}}({{I_{diff}}} )$$

(3)$${I_{en}} = {k_{diff}} \times {I_{diff}} + {k_{dtl}} \times {I_{dtl}}$$

where I_s is the scene image, I_diff is the differential image, I_dtl is the detail image, I_en is the enhanced image, F_at is the adaptive threshold function, d is the attenuation coefficient, k_diff and k_dtl are the scale factors of the differential image and detail image, respectively. After obtaining the first enhanced image, we stretch the grayscale and take it as the new original scene image to continue the enhancement process. The attenuation coefficient and fusion weight need to be updated in this process. The iteration is then executed to increase the difference between the pixel gray level of the foreground and background. Moreover, the accuracy of the subsequent segmentation work is guaranteed because the detailed information is integrated. Additionally, it is not available to subtract background images from scene image directly because it leads to more noises and wrong attenuation.

Fig. 4. Flow chart of image pre-processing.

Download Full Size | PDF

3.2 Sliding-window weight fusion (SWWF)

The SWWF framework is designed to fuse the pre-processed images from the depth and intensity cues. Assuming both sizes of them are m × n pixels. First, set the size of the sliding window to w × w, and slide by a step of w pixels on the depth and intensity images. Then the mean grayscale values of the pixels in the window block is calculated to obtain two (m/w) × (n/w) pixel matrices, which are named decision matrices (DMs). Finally, the DMs are sent into the comparator to perform the image fusion. The specific algorithm is described in Fig. 5. By using SWWF, the foreground objects are further highlighted, while the detailed information is preserved as much as possible.

Fig. 5. Flow chart of fusing, thresholding and locating. DM: decision matric; CA: comparator algorithm; TS: thresholding.

Download Full Size | PDF

Algorithm 1: Pseudo code of comparator algorithm

Input: DI: depth image; II: intensity image; DM_D: decision matrix of DI; DM_I: decision matrix of II; a = m/w; b = n/w.

Output: FI: fused image.

1: for i = 0: a

2: for j = 0: b

3: if DM_D (i, j) < 14 and DM_I (i, j) < 14

4: FI(i*a: (i + 1)*a, j*b: (j + 1)b) =

5: min(DI(i*a: (i + 1)*a, j*b: (j + 1)*b),

6: II(i*a: (i + 1)*a, j*b: (j + 1)*b))

7: else if DM_D (i, j) > 14 and DM_I (i, j) > 14

8: FI (i* a: (i + 1)*a, j*b: (j + 1)b) =

9: max(DI (i* a: (i + 1)*a, j*b: (j + 1)*b),

10: II (i* a: (i + 1)*a, j*b: (j + 1)*b))

11: else:

12: FI (i* a: (i + 1)*a, j*b: (j + 1)b) =

13: 0.5*(DI (i*a: (i + 1)*a, j*b: (j + 1)*b)) +

14: 0.5*(II (i*a: (i + 1)*a, j*b: (j + 1)*b))

15: end for

16: end for

3.3 Thresholding and clustering

The grayscale gap between the foreground and the background has been larger through the previous image processing, so it provides the convenience of using threshold method to remove the background. As shown in Fig. 5, foreground/background segmentation could be achieved by setting the threshold manually or adaptively. And images are further denoised to remove the outliers or holes. The foreground pixels are then clustered to obtain several groups, which correspond to different objects. In this way, we can calculate the 2D position of the objects in the camera coordinate system, and further locate the 3D world coordinate by using depth information and camera model.

3.4 Feature extraction

It is obvious that objects in different colors or materials exhibit wavelength selectivity when reflecting or absorbing from the surfaces. Furthermore, different surface characteristics, such as texture and roughness, could lead to different reflecting effects. Consequently, one can utilize the optical features to identify different object types, which can avoid the training process of deep neural networks and large amount of computation costs. In this work, we extract and synthesize four comprehensive and complementary features through the fusion analysis of depth and intensity data: depth (DP), mean intensity value (MV), intensity standard deviation (SD), and intensity gradient (GD). By utilizing MV, SD, and GD to characterize the intensity information, we aim to fully exploit the richness and effectiveness of the intensity data, ensuring comprehensive information retrieval.

1) Depth
Depth, representing the distance from the object to the camera, not only accurately conveys spatial location but also exerts an influence on the reflection intensity. This influence is attributed to atmospheric scattering or absorption [39,40], causing reduced light reflection received by the camera as objects move farther away. Hence, we can adopt the depth information as a vital feature, which can be collected by the depth camera directly.
2) Mean intensity value and standard deviation
Intensity is a pivotal physical quantity for describing the reflection. It serves not only to accentuate differences between targets through distinctive and feature-rich reflectance effect, but also to alleviate the adverse effects of ambient lighting. The mean value (MV) and standard deviation (SD) are widely employed statistics in this context. In our approach, we calculate the MV and SD for each pixel cluster to gauge the average reflection ability and homogeneity of various surfaces. This enables a comprehensive assessment of the surface characteristics based on their reflective properties.
3) Intensity gradient
Gradient serves as a robust representation of surface texture and intensity fluctuations. Several gradient feature extraction algorithms, such as SIFT [41] and HoG [42], are widely utilized in computer vision. As depicted in Fig. 6, the gradient extraction is accomplished through the following steps. First, Canny operator [43] is used to detect the object edge and obtain the gradient images (GIs). Next, the central coordinates of each pixel group C_m (x0, y0) are then approximately computed by
$(4)$${x_0} = \frac{{SUM({{x_i}} )}}{N}$$$ $(5)$${y_0} = \frac{{SUM({{y_i}} )}}{N}$$$ where N is the number of pixels in each cluster. Then, we take a box with a width of 20 pixels and a height of 30 pixels, and it is centered in C_m. Finally, the mean pixels value is calculated in this region and regarded as gradient indicator.

Fig. 6. An example of intensity gradient extraction.

Download Full Size | PDF

Thus, the joint feature space, consisting of the above four features from the training set, will provide a feature database, which is the basis of the further object detection.

3.5 Object detection

In this section, we use four machine learning (ML) methods, including decision tree (DT), KNN, SVM and artificial neural network (ANN), to identify the categories of objects with feature database. As shown in Fig. 7, four features are taken as the inputs to train different ML techniques, and the outputs are the classification results. DT, KNN and SVM are traditional machine learning methods. The ANN technique includes three full-connected (FC) layers, and it takes sigmoid and softmax as nonlinear activation functions. Noted that the input layer has 12 neurons, the input dimension is tripled to optimize the network structure. Backpropagation and stochastic gradient descent (SGD) are used to train the network. Thus, the object detection task can be completed through localization and classification [44].

Fig. 7. ML methods for identification and the ANN structure.

Download Full Size | PDF

4. Experimental setup and results

4.1 System building and data preparation

As shown in Fig. 8, an image collecting system is built to verify the proposed method. The working wavelength is 860 nm. The resolution of the ToF camera is set to 640 × 480 pixels, and the work distance is from 0.3 m to 2.4 m. The depth images and infrared intensity images could be captured simultaneously with the viewer software or special program to acquire the accurate-matched image pairs in time series. When creating the dataset, the objects with different materials are placed at different positions and distances from the camera. Each frame would contain 2 to 5 object images, which are in the same or different types. Finally, we generate a dataset containing 1066 images of the objects with six different materials. All data are randomly divided into training and testing sets by a 4:1 ratio.

Fig. 8. (a) Concept and (b) experimental setup of the image collecting system. PC: personal computer; ToF: Time-of-Flight depth camera.

Download Full Size | PDF

4.2 Foreground/background segmentation and localization

Figure 9 shows the foreground/background segmentation results after pre-processing and fusing. These pictures are intercepted from complete data frames. The first and second columns are the original depth and intensity images. The segmentation results based on the intensity images are shown in the third column. Figure 10 displays the further clustering results. The first and second columns are the depth and intensity images. The third column is the clustering and localization results. Apparently, the pixel groups are divided into different classes. In conclusion, the objects could be located in the images by giving bounding box. The green boxes represent the ground-truth, and the red boxes describe the predicted segmentations. Here, the intersection over union (IoU) is adopt to evaluate the localizing performance. If the IoU surpasses 50%, the object is considered to be well localized [44,45]. We randomly sample 10% of the data from the testing set to calculate the IoU, and the average result is 0.778. These results show that the proposed method performs well in the segmentation and localization.

Fig. 9. Foreground/background segmentation for the objects with six different materials. Row1: Depth images; Row2: Intensity images; Row3: Segmentation results.

Download Full Size | PDF

Fig. 10. Results of clustering and localization.

Download Full Size | PDF

4.3 Feature extraction and identification

In this section, we extracted the above four features of all images, and established the feature database from the training set. Figure 11 illustrates the feature distribution in a vector space (GD is not included due to the dimension limitation) of six different objects, and Fig. 12 shows the mean-normalized MV, SD and GD of different depth. It is observed that the MV of LBC, LBP and DBP are in a relative low level, and they are approximately negatively correlated with the depth value. This phenomenon confirms the relationship between reflectivity intensity and depth mentioned earlier. In addition, SD and GD of LBC, LBP, and DBP are much smaller than the other three materials. It indicates that their surfaces have less texture. The HRS surface has a high reflectivity, and thus it has the highest MV. Moreover, compared with the other five materials, HMS obtains the highest GD because of its mesh texture. As for AFS, the reflection is not homogeneous due to its metal surface, which leads to stochastic results when randomly placing their positions. Therefore, the distribution of the points belonging to AFS is more discrete in our training set.

Fig. 11. Distance, mean value, and standard deviation distributions in a vector space of training set.

Download Full Size | PDF

Fig. 12. Normalized mean value, standard deviation, gradient feature of six materials of training set.

Download Full Size | PDF

Ultimately, we utilize these features to train machine learning modules and evaluate our proposed method on the testing set. Since KNN and SVM are more sensitive to the Euclidean distance, the coordinates scale is adjusted in the training process. Figure 13 suggests the total accuracy of DT, KNN, SVM and ANN when testing on six materials. The accuracy measures the ratio between correct predictions and total observations, which is calculated by

(6)$$Accuracy = \frac{{TP + TN}}{{All}}$$

where TP is true positives (predict positive class as positive class) and TN is true negatives (predict negative class as negative class). One can see that all the four ML methods have >90% accuracy, and KNN performs the best accuracy. DT and ANN performs slightly better than SVM, but still worse than KNN. Furthermore, we usually use precision to assess the performance of ML techniques when it comes to specific categories. The precision is calculated by

(7)$$Precision = \frac{{TP}}{{TP + FP}}$$

where FP is false positives (predict negative class as positive class). Table 1 illustrates the precision of four ML methods on each material. Generally, KNN method provides the best identification results of six materials. It has 100% prediction precision on AFS, HRS, DBP and LBP. The reason is that we select an appropriate Euclidean distance scale and a suitable nearest neighbor number K, which are crucial parameters for KNN. DT has relatively better average precisions although it has 100% correct predictions on only two kinds of materials. SVM seems to perform better on estimations of AFS, HMS and HRS, which is related to the principle of the algorithm. Indeed, the most important task for SVM is to find a hyperplane which can maximizes the margin [46,47]. By employing gradients on the basis of Fig. 11, AFS, HMS and HRS can be linearly separated in four-dimensional space. For neurons in ANN, the features of LBC are too similar to DBP and LBP, which may lead to the poor detecting precision of LBC.

Fig. 13. Total accuracy histograms of identification using different ML methods.

Download Full Size | PDF

Table 1. Precisions of Detection Using Different ML Methods

View Table

The interesting phenomenon in our experiment is that the more complex ML methods, such as SVM and ANN, do not perform well on identification. Conversely, traditional simple technique like KNN could have >98% total accuracy. The reason for this result is that the simple ML methods can reduce the computational cost and has strong applicability for the low-dimension joint features. However, ANN module usually needs the amount and complexity of the features. If the features are less complex, the difference between ANN and traditional ML methods is not pronounced, and it is tough to find a satisfactory structure to make neurons trained well and reach the balance between underfitting and overfitting [48].

5. Conclusion

We have demonstrated an object detection solution based on ToF camera and depth-intensity images fusion. In this work, the image processing methods are developed, especially the SWWF, to enhance the data and prepare for the following steps. In addition, feature extraction and lightweight machine learning methods are utilized to identify the objects. Finally, we built a dataset involving six categories of the objects located at different positions to verify our method. As a ToF device is implemented, our proposed approach could achieve identification works in various or tenebrous illumination conditions. Compared with deep convolutional neural networks, this method could reduce the computation costs. The experiment results indicate that the proposed method could detect objects effectively, especially for the KNN technique, which features a > 98% total accuracy. This method of utilizing two data streams captured by a single ToF sensor in the specific manner described above could be potentially applied into industrial and agricultural production under different illumination conditions, helping mitigate the degradation effects from the ambient lighting conditions by using passive vision techniques.

Nevertheless, there are also some limitations that need to be discussed. Firstly, the experiments are carried out on the objects with six kinds of materials, which means that the generalization ability needs to be further verified. The types of objects will be further expanded in the following works. Besides, the case of object occlusion is not considered in this proof-of-concept demonstration. In addition, complex or undesirable object placement and surface conditions, such as object with largely tilted orientation, or inaccurate depth value caused by approximate specular reflection of high reflectivity materials, may also cause a reduction in the recognition accuracy according to the present method. In future works, this problem will be solved by using image processing algorithm and multiple locations monitoring.

Funding

Shaanxi Province Innovation Talent Promotion Program-Science and Technology Innovation Team (2023-CX-TD-03).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. F. Saeed, M. J. Ahmed, M. J. Gul, et al., “A robust approach for industrial small-object detection using an improved faster regional convolutional neural network,” Sci. Rep. 11(1), 23390 (2021). [CrossRef]

2. X. Du, M. Yu, Z. Ye, et al., “A passive target recognition method based on LED lighting for industrial internet of things,” IEEE Photonics J. 13(4), 1–8 (2021). [CrossRef]

3. L. Butera, A. Ferrante, M. Jermini, et al., “Precise agriculture: Effective deep learning strategies to detect pest insects,” IEEE/CAA Journal of Automatica Sinica 9(2), 246–258 (2022). [CrossRef]

4. Z. Zhao, Y. Liu, X. Sun, et al., “Composited FishNet: Fish detection and species recognition from low-quality underwater videos,” IEEE Trans. on Image Process. 30, 4719–4734 (2021). [CrossRef]

5. B. Hou, Z. Ren, W. Zhao, et al., “Object detection in high-resolution panchromatic images using deep models and spatial template matching,” IEEE Trans. Geosci. Remote Sensing 58(2), 956–970 (2020). [CrossRef]

6. B. Janakiramaiah, G. Kalyani, A. Karuna, et al., “Military object detection in defense using multi-level capsule networks,” Soft. Comput. 27(2), 1045–1059 (2023). [CrossRef]

7. L. Collins, G. Ping, and L. Carin, “An improved Bayesian decision theoretic approach for land mine detection,” IEEE Trans. Geosci. Remote Sensing 37(2), 811–819 (1999). [CrossRef]

8. C. Chen, B. Liu, S. Wan, et al., “An edge traffic flow detection scheme based on deep learning in an intelligent transportation system,” IEEE Trans. Intell. Transp. Syst. 22(3), 1840–1852 (2021). [CrossRef]

9. X. Yuan, X. Hao, H. Chen, et al., “Robust traffic sign recognition based on color global and local oriented edge magnitude patterns,” IEEE Trans. Intell. Transp. Syst. 15(4), 1466–1477 (2014). [CrossRef]

10. A. Shakeri, B. Moshiri, and H. G. Garakani, “Pedestrian detection using image fusion and stereo vision in autonomous vehicles,” in 2018 9th International Symposium on Telecommunications (IST 2018) (2018), pp. 592–596.

11. Paul Viola and Michael Jones, “Rapid object detection using a boosted cascade of simple features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2001), pp. I. [CrossRef]

12. R. Girshick, J. Donahue, T. Darrell, et al., “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), pp. 580–587.

13. M. H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting faces in images: A survey,” IEEE Trans. Pattern. Anal. 24(1), 34–58 (2002). [CrossRef]

14. J. Redmon, S. Divvala, R. Girshick, et al., “You only look once: Unified, real-time object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 779–788.

15. D. Becker and S. Cain, “Improved space object detection using short-exposure image data with daylight background,” Appl. Opt. 57(14), 3968–3975 (2018). [CrossRef]

16. J. Ma, Y. Ma, and C. Li, “Infrared and visible image fusion methods and applications: A survey,” Info. Fus. 45, 153–178 (2019). [CrossRef]

17. Z. Yu, Y. Zhuge, H. Lu, et al., “Joint learning of saliency detection and weakly supervised semantic segmentation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 7223.

18. D. P. Fan, T. Zhou, G. Ji, et al., “Inf-net: Automatic covid-19 lung infection segmentation from ct images,” IEEE Trans. Med. Imaging 39(8), 2626–2637 (2020). [CrossRef]

19. S. Su, F. Heide, R. Swanson, et al., “Material classification using raw time-of-flight measurements,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 3503–3511.

20. C. Bamji, J. Godbaz, M. Oh, et al., “A review of indirect time-of-flight technologies,” IEEE Trans. Electron Devices 69(6), 2779–2793 (2022). [CrossRef]

21. F. Heide, L. Xiao, W. Heidrich, et al., “ Diffuse mirrors: 3D reconstruction from diffuse indirect illumination using inexpensive time-of-flight sensors,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), pp. 3222–3229.

22. S. Li, X. Kang, L. Fang, et al., “Pixel-level image fusion: A survey of the state of the art,” Info. Fus. 33, 100–112 (2017). [CrossRef]

23. S. R. Schnelle and A. L. Chan, “Enhanced target tracking through infrared-visible image fusion,” in 14th International Conference on Information Fusion (FUSION) (2011), pp. 1–8.

24. E. Fendri, R. R. Boukhriss, and M. Hammami, “Fusion of thermal infrared and visible spectra for robust moving object detection,” Pattern Anal. Appl. 20(4), 907–926 (2017). [CrossRef]

25. J. C. Castillo, A. Fernández-Caballero, J. Serrano-Cuerda, et al., “Smart environment architecture for robust people detection by infrared and visible video fusion,” J. Ambient Intell. Humaniz. Comput. 8(2), 223–237 (2017). [CrossRef]

26. J. Yao, Y. Zhang, F. Liu, et al., “Object Detection Based On Decision Level Fusion,” in 2019 Chinese Automation Congress (CAC 2019) (2019), pp. 3257–3262.

27. R. Girshick, “Fast R-CNN,” in 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 1440–1448.

28. S. Q. Ren, K. M. He, R. Girshick, et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Advance in Neural Information Processing Systems 28 (NIPS 2015) (2015).

29. X. Xiao, B. Wang, L. Miao, et al., “Infrared and Visible Image Object Detection via Focused Feature Enhancement and Cascaded Semantic Extension,” Remote Sens. 13(13), 2538–2556 (2021). [CrossRef]

30. M. Rapus, S. Munder, G. Baratoff, et al., “Pedestrian recognition using combined low-resolution depth and intensity images,” in 2008 IEEE Intelligent Vehicles Symposium (IV) (2008), pp. 632–636.

31. M. Enzweiler and D. M. Gavrila, “A multilevel Mixture-of-Experts framework for pedestrian classification,” IEEE Trans. on Image Process. 20(10), 2967–2979 (2011). [CrossRef]

32. A. Makris, M. Perrollaz, and C. Laugier, “Probabilistic Integration of Intensity and Depth Information for Part-Based Vehicle Detection,” IEEE Trans. Intell. Transp. Syst. 14(4), 1896–1906 (2013). [CrossRef]

33. C. Beleznai, P. Gemeiner, and C. Zinner, “Reliable Left Luggage Detection Using Stereo Depth and Intensity Cues,” in IEEE International Conference on Computer Vision Workshops (ICCVW) (2013), pp. 59–66.

34. C. G. Keller, M. Enzweiler, M. Rohrbach, et al., “The Benefits of Dense Stereo for Pedestrian Detection,” IEEE Trans. Intell. Transp. Syst. 12(4), 1096–1106 (2011). [CrossRef]

35. J. Zhang, D. P. Fan, Y. Dai, et al., “UC-Net: Uncertainty Inspired RGB-D Saliency Detection via Conditional Variational Autoencoders,” in 2020 Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 8579–8588.

36. Y. Piao, Z. Rong, M. Zhang, et al., “A2dele: Adaptive and Attentive Depth Distiller for Efficient RGB-D Salient Object Detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 9057–9066.

37. J. Zhao, Y. Zhao, J. Li, et al., “Is depth really necessary for salient object detection?” in 28th ACM International Conference on Multimedia (ACM-MM) (2020), pp. 1745–1754.

38. Z. Huang, H.-X. Chen, T. Zhou, et al., “Multi-level cross-modal interaction network for RGB-D salient object detection,” Neurocomputing 452, 200–211 (2021). [CrossRef]

39. H. Lu, Y. Zhang, Y. Li, et al., “Depth Map Reconstruction for Underwater Kinect Camera Using Inpainting and Local Image Mode Filtering,” IEEE Access 5, 7115–7122 (2017). [CrossRef]

40. T. Muraji, K. Tanaka, T. Funatomi, et al., “Depth from phasor distortions in fog,” Opt. Express 27(13), 18858–18868 (2019). [CrossRef]

41. D. G. Lowe, “Object recognition from local scale-invariant features,” in IEEE Conference on Computer Vision (ICCV) (1999), pp. 1150–1157.

42. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2005), pp. 886–893.

43. J. Canny, “A Computational Approach to Edge Detection,” IEEE Trans. Pattern Anal. 8(6), 679–698 (1986). [CrossRef]

44. G. Mora-Martin, A. Turpin, A. Ruget, et al., “High-speed object detection with a single-photon time-of-flight image sensor,” Opt. Express 29(21), 33184–33196 (2021). [CrossRef]

45. Z. W. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into High Quality Object Detection,” in IEEE Conference on Computer vision and Pattern Recognition (CVPR) (2018), pp. 6154–6162.

46. C. J. C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery 2(2), 121–167 (1998). [CrossRef]

47. X. Wu, W. Zuo, L. Lin, et al., “F-SVM: Combination of Feature Transformation and SVM Learning via Convex Relaxation,” IEEE Trans. Neural Netw. Learn. Syst. 29(11), 5185–5199 (2018). [CrossRef]

48. Y. Zhang, Y. Ren, G. Xie, et al., “Detecting Object Open Angle and Direction Using Machine Learning,” IEEE Access 8, 12300–12306 (2020). [CrossRef]

Image-fusion-based object detection using a time-of-flight camera

Abstract

1. Introduction

2. Related works

2.1 TIR-V based methods

2.2 D-I based methods

3. Method

3.1 Image pre-processing

3.2 Sliding-window weight fusion (SWWF)

3.3 Thresholding and clustering

3.4 Feature extraction

3.5 Object detection

4. Experimental setup and results

4.1 System building and data preparation

4.2 Foreground/background segmentation and localization

4.3 Feature extraction and identification

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (13)

Tables (1)

Equations (7)

Optics Express

	AFS	HMS	HRS	DBP	LBP	LBC
DT	91.67	97.06	100	100	91.67	85
KNN	100	97.14	100	100	100	86.96
SVM	100	100	100	96.92	86.27	64.7
ANN	91.67	100	96.42	98.44	100	73.07