Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

3D object detection through fog and occlusion: passive integral imaging vs active (LiDAR) sensing

Open Access Open Access

Abstract

In this paper, we address the problem of object recognition in degraded environments including fog and partial occlusion. Both long wave infrared (LWIR) imaging systems and LiDAR (time of flight) imaging systems using Azure Kinect, which combine conventional visible and lidar sensing information, have been previously demonstrated for object recognition in ideal conditions. However, the object detection performance of Azure Kinect depth imaging systems may decrease significantly in adverse weather conditions such as fog, rain, and snow. The concentration of fog degrades the depth images of Azure Kinect camera, and the overall visibility of RGBD images (fused RGB and depth image), which can make object recognition tasks challenging. LWIR imaging may avoid these issues of lidar-based imaging systems. However, due to poor spatial resolution of LWIR cameras, thermal imaging provides limited textural information within a scene and hence may fail to provide adequate discriminatory information to identify between objects of similar texture, shape and size. To improve the object detection task in fog and occlusion, we use three-dimensional (3D) integral imaging (InIm) system with a visible range camera. 3D InIm provides depth information, mitigates the occlusion and fog in front of the object, and improves the object recognition capabilities. For object recognition, the YOLOv3 neural network is used for each of the tested imaging systems. Since the concentration of fog affects the images from different sensors (visible, LWIR, and Azure Kinect depth cameras) in different ways, we compared the performance of the network on these images in terms of average precision and average miss rate. For the experiments we conducted, the results indicate that in degraded environment 3D InIm using visible range cameras can provide better image reconstruction as compared to the LWIR camera and Azure Kinect RGBD camera, and therefore it may improve the detection accuracy of the network. To the best of our knowledge, this is the first report comparing the performance of object detection between passive integral imaging system vs active (LiDAR) sensing in degraded environments such as fog and partial occlusion.

© 2022 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

There is substantial interest in object recognition for many applications using both passive and active sensors such as lidar. Autonomous driving vehicles with lidar-based sensors can detect and recognize traffic objects precisely in ideal weather conditions, but their performance may suffer in adverse environmental conditions such as fog, rain, and partial occlusions. Environmental degradation reduces the quality of the captured images and therefore reduces the performance of object detection [14]. Thermal imaging systems, which only depend on the temperature difference of objects and environments, can be used in such a degraded environment for object detection and classification. However, thermal imaging provides limited textural information of objects as a result of low spatial resolution sensors. As such, defining high quality object features becomes difficult, and object recognition approaches may fail to detect and classify objects of similar shape and size [5,6].

Three-dimensional (3D) integral imaging (InIm) using a visible camera is a prominent technique for object recognition tasks that provides depth information, high spatial resolution as compared to IR imaging, and uses depth segmentation to isolate the object of interest from the background scene and partial occlusions [710]. These characteristics of 3D InIm can be used to improve the object detection and localization tasks in degraded environments.

In integral imaging, multiple 2D perspectives of a 3D scene, known as elemental images, can be recorded either by using a camera array, lenslet array or a single camera on a moving platform. These elemental images can be used to reconstruct the 3D scene either computationally or optically. The reconstructed 3D image provides depth slicing of the scene that can benefit automated object recognition in foggy, and partially occluded environments not possible using conventional 2D imaging [1123].

Since the quality of images of different sensors is affected differently in degraded environments, we train an object detection network for each passive or active imaging system separately. Our aim is to compare the performance of object recognition between these active and passive sensors in degraded environments. Moreover, for visible and LWIR sensing, we consider both conventional 2D imaging and 3D InIm. For the Azure Kinect imaging system, we use an RGBD image which is a fused image between the Azure Kinect RGB camera and its lidar depth sensor. We choose the YOLOv3 [24] network for this object recognition task, and the system performance is compared in terms of average precision and average miss rate. The data set is recorded in different levels of fog (low, moderate, and dense) without occlusion to train the network, and testing data is recorded in fog both with and without partial occlusion. For the experiments reported here, the results indicate that 3D InIm using a visible range camera may improve the performance of object detection network to successfully detect and classify the objects in adverse conditions of environments including fog and partial occlusion.

2. Materials and methods

2.1 Integral imaging

In this work, synthetic aperture integral imaging (SAII) is used to capture the 3D information of the scene [23]. An image sensor on a 2D translational stage is used to capture the 2D elemental images from different perspectives of the 3D scene. A fog chamber, shown in Fig. 1(a), is placed in front of the camera to create foggy environments using a fog generator. The pickup process of InIm is shown in Fig. 1 (a), wherein a single camera is attached to two axis translational stage. The 3D image of a scene at a particular depth is reconstructed by backpropagating the elemental images to that depth through a virtual pinhole array. The computational reconstruction process of 3D image is shown in Fig. 1(b). The reconstructed computational image at different distances is a depth-sectioned image of the 3D scene. This reconstructed image mitigates the effects of partial occlusion in front of the objects of interest and provides an improved signal-to-noise ratio and improved image reconstruction compared to the conventional 2D imaging [7,9,22,23].

 figure: Fig. 1.

Fig. 1. Schematic diagram of 3D InIm using synthetic aperture InIm (SAII). (a) Pickup process of InIm (b) reconstruction of 3D images via virtual pinhole array.

Download Full Size | PDF

The mathematical formulation of the computational 3D reconstructed image is given as [23],

$${R_z}(x,y,z) = \frac{1}{{O(x,y)}}\sum\limits_{m = 0}^{M - 1} {\sum\limits_{n = 0}^{N - 1} {\left[ {{I_{m,n}}\left( {x - \frac{{m \times {L_x} \times {p_x}}}{{{c_x} \times z/f}},y - \frac{{n \times {L_y} \times {p_y}}}{{{c_y} \times z/f}}} \right) + \varepsilon } \right]} } ,$$
where, (x, y) is the pixel location and z is the reconstruction depth of a scene. (m, n) are the indices of elemental images along the x and y directions, respectively, and Im,n is the corresponding elemental image. M and N are the total numbers of elemental images. O(x, y) is the number of overlapped pixel values on (x, y). f is the focal length of lens attached to the camera, and (px, py) are the pitch between the adjacent moving camera in x and y directions, respectively. (Lx, Ly) and (cx, cy) are the total number of pixels and size of the sensor, respectively. ε is the additive camera noise.

2.2 YOLOv3 network

The object detection task is performed by YOLOv3 [24], which is an advanced version of YOLO [25]. YOLO is an object detector that uses a single convolutional neural network to simultaneously predict multiple objects’ bounding boxes in an image as well as class probabilities for those boxes. It detects the location of objects in the image by creating a bounding box around the objects and then classifies them. The YOLO architecture consists of 24 convolutional layers followed by two fully connected layers. The network divides the input image into S × S grids. If the center of an object falls in a grid, then that grid is responsible for detecting the object. Each grid cell predicts B bounding boxes, including the confidence score and position information of each bounding box. The probability of containing the target is represented by the confidence score (c). The confidence score of a bounding box tells us how confident the model is in predicting the object and how accurate is the bounding box that presents the object. The confidence score is defined as:

$${\textrm{Confidence score}} = pr(Object) \times IOU_{pred}^{truth}, $$
where, pr(Object) is the probability that the box contains an object. The intersection over union (IOU) is used to measure the accuracy of the object detector, which defines the correlation between the ground truth and the predicted box. The value of IOU between the ground truth and predicted box varies from 0 (no overlap) to 1 (complete overlap). Therefore, the confidence score encodes both probabilities of an object and how well the predicted box fits the object. The YOLOv3 proposes a new architecture Darknet-53 and predicts objects at three different scales. Because of multiscale prediction, it performs better on smaller objects. Darknet-53 is a residual network-type feature extractor that consists of 53 convolutional layers and shortcut connections between residuals for feature extraction, which can reduce computation while extracting high-performance features. YOLOv3 replaces the softmax function with independent logistic classifiers to calculate the likeliness of the input belonging to a specific label.

The performance of the detectors is measured in terms of average precision and average miss rate. If the intersection over union (IOU) exceeds the threshold for the bounding box, then the prediction is considered as correct. In this experiment, we set the threshold value to 0.5 for object detection. The precision, recall, and miss rate are defined in terms of true positive (TP), false positive (FP), and false negative (FN) as:

$$\begin{aligned} &{Precision} = \frac{{TP}}{{TP + FP}},\\ &{Recall} = \frac{{TP}}{{TP + FN}},\\ &{Miss rate} = \frac{{FN}}{{FN + TP}}. \end{aligned}$$

True positive is when the detector classifies the object correctly, false positive is when detector classifies object incorrectly, and false negative is when the model fails to detect the object in an image. Finally, average precision, which is equal to the area under the precision-recall curve, is measured over all IOU threshold values from 0.5 to 1. We choose YOLOv3 architecture for object detection task in adverse environmental conditions because it has better performance on small, and medium objects in both visible and LWIR ranges [26,27]. YOLOv3 is significantly faster than other detectors for object detection and classification [27].

3. Experimental Set up

A fog chamber is placed in front of the camera to create foggy environments, and a fog generating device is used for fog inside the chamber in case of visible and LWIR range imaging system. We used a thin plastic (polythene material) for the fog chamber so that the LWIR image could propagate through the material. The visibility of the scene is varied by controlling the concentration of fog inside the chamber. To quantify the concentration of fog, we used the Beer-Lambert law to calculate the Beer’s coefficient (attenuation coefficient) α. The Beer-Lambert law is given as $I = {I_0}{e^{ - \alpha z}}$, where I0 is the intensity of laser diode (wavelength 630 nm) without fog (I0 = 820 μW), and I is the intensity of laser diode after propagating through distance of (z = 470 mm) in the fog medium. The reduction of the visibility of the scene due to various fog concentrations is shown in Fig. 2, and the corresponding attenuation coefficient are tabulated in Table1. Furthermore, the visibility of scenes in different levels of fog is estimated by calculating the scattering coefficient from the recorded image [28]. Koschmieder atmospheric scattering model is used to describe the formation of haze which is given as [28];

$$\begin{aligned} I(x) &= J(x)t(x) + A(1 - t(x)),\\ t(x) &= {e^{ - \mathrm{\beta} d(x)}}, \end{aligned}$$
where, I(x) is the observed (haze) image, J(x) is the haze free image (ideal image), A is the atmospheric light, and t(x) is the medium transmittance that measures the ratio of light that is not attenuated and reaches the sensors. d(x) is the distance between the objects and the observer, and β is the scattering coefficient of the atmosphere that defines the thickness of fog. In addition to Beer’s coefficient, we have estimated the scattering coefficient of the experimental dataset to quantify the levels of fog. To quantify the thickness of fog in our experimental dataset, we have divided the dataset based on scattering coefficients into three different categories. The scattering coefficient for high dense fog (β = 8 to 10), moderate fog (β = 4 to 8), and low fog (β = 1.5 to 4). The scattering coefficients of the scene in Fig. 2 are, 3.2, 7.8, and 9.4 for low, moderate, and high-dense fog, respectively as shown in Table 1.

 figure: Fig. 2.

Fig. 2. Visible 2D images in different levels of fog. (a) Low (β = 3.2), (b) moderate (β = 7.8), and (c) dense fog levels (β=9.4).

Download Full Size | PDF

Tables Icon

Table 1. The density of fog and its corresponding Beer’s coefficient and scattering coefficient.

A visible camera for visible imaging and an LWIR camera for thermal imaging are used to capture the elemental images. The visible camera has a pixel size of 6.5μm × 6.5μm with a sensor size of 2048 × 2048 pixels and the LWIR camera has a pixel size of 17μm × 17μm with a sensor of size 320 × 240 pixels. The focal length of visible and LWIR cameras are 50 mm and 11 mm, respectively. Total 9 elemental images (3 horizontal and 3 vertical) were recorded with a pitch of 30 mm in x and y directions for both visible and LWIR cameras. The sample image of training dataset for visible and LWIR cameras are shown in Fig. 3. The 2D image used in this experiment is the central perspective of the elemental images as shown in Fig. 3 (a) for visible camera and Fig. 3(e) for LWIR camera. The depth images of a 3D scene for both visible and LWIR cameras are reconstructed using SAII algorithm (Eq. (1)). Figure 3(b-d) show the 3D images at different depths for visible scene, and Fig. 3 (f-h) shows the images at different depth of a thermal scene.

 figure: Fig. 3.

Fig. 3. Sample image for training the network in visible and LWIR range. (a) Central perspective of 2D elemental images, (b-d) 3D reconstructed images at different depths in visible range. Corresponding LWIR images for (e) 2D elemental image and, (f-h) reconstructed 3D images at different depths. The classes used in these experiments are CM (cold mannequin), HM (Hot mannequin), LK (Large Kettle), SK (Short Kettle), IRN (Iron), Wrench, and Glass (Glass beaker).

Download Full Size | PDF

A sample image of RGBD dataset for training the network is shown in Fig. 4. Use of fog tank causes ambiguity in lidar depth sensing as the tank surface reflects back most of the near IR rays. To mitigate this issue, scenes for lidar imaging in fog were recorded in an open environment filled with fog. The color and depth information of the scene were recorded using the latest generation of Microsoft Kinect depth cameras, i.e. Azure Kinect camera. In addition to RGB channel (Fig. 4(a)), depth channel also provides valuable information present in the scene as shown in Fig. 4(b). In this work, we utilized both RGB and depth channels to improve the performance of the detector as compared to only RGB channel. The RGB and depth images are first aligned with each other, and then fused together [adding the RGB and depth channels (D)] to get four channel RGBD image as shown in Fig. 4(c). The YOLOv3 network is modified by replacing the three-channel input layer with four channel input layers to train the RGBD images. In order to display the RGB and depth information in an aligned image as shown in Fig. 4(c), we used two channels of RGB and one channel depth. This is used only for the purpose of visualization.

 figure: Fig. 4.

Fig. 4. Sample image for training the network in RGBD with Kinect. (a) RGB image of the scene, (b) depth image (D) of the scene, and (c) fused RGBD image of the scene. The classes used in these experiments are CM (cold mannequin), HM (Hot mannequin), LK (Large Kettle), SK (Short Kettle), IRN (Iron), Wrench, and Glass (Glass beaker).

Download Full Size | PDF

The training dataset for visible, LWIR, and RGBD images of Azure Kinect camera was recorded in foggy environments. A total of 90 scenes (360 images) in different levels of fog were recorded for training the network The training was done on visible, LWIR, and RGBD dataset. The pretrained Darknet-53 was used as a feature extractor, where initial 53 convolutional layers have been used for extracting the features. The prediction layers of network were trained from scratch using the training dataset. To handle the monochromatic images, the network was trained by stacking the grayscale of images into 3 slices. The YOLOv3 network was trained with stochastic gradient descent with momentum (SGDM) optimizers [29] and the hyperparameters such as batch size, number of epochs, and learning rate of YOLOv3 for three sensors were tuned on a validation dataset different from the training dataset. We choose seven scenes (28 images) from each cameras for hyperparmeter tuning. The hyperparameters is tuned via grid search and the values of batch size, number of epochs and learning rate for tuning the networks are (8, 16, 32), [70, 80, 100, 150] and [$1 \times {10^{ - 4}},5 \times {10^{ - 4}},1 \times {10^{ - 3}},5 \times {10^{ - 3}}$], respectively using early stop criteria. The best hyperparameters for each detector are selected (see Table 2) for the classification of objects. In our experiment, seven different classes were chosen for object detection. In order to compare the performance of the three sensors (visible, LWIR, and Azure Kinect), five different YOLOv3 detectors were trained using the training dataset to compare between: 2D and 3D InIm visible range imaging, 2D and 3D InIm LWIR range imaging and RGBD imaging with the Azure Kinect camera.

Tables Icon

Table 2. Optimal hyperparameters of object detectors

4. Results and discussions

The testing dataset was recorded in the presence of different levels of fog for both with partial occlusion (30 images) and without partial occlusion (25 images) of the objects in the scene. In total 55 images were recorded with each camera system for testing the performance of five trained detectors. In order to diversify the testing dataset, we have recorded scenes in different levels of fog with estimated range of scattering coefficients (β) from 1.5 to 10. We have used an equal number of high, moderate, and low fog-density dataset to test the trained detectors for all imaging modes. The sample images of central perspective 2D image of partially occluded objects in moderate dense fog environments (β = 7.2) for visible and LWIR are shown in Fig. 5(a) and Fig. 5(d), respectively. The reconstructed 3D images at a particular depth for visible images are shown in Fig. 5 (b-c), and for LWIR are shown in Fig. 5 (e-f). The last row of Fig. 5 shows the RGBD image of partially occluded objects in fog obtained by Kinect. The advantage of 3D InIm over conventional 2D imaging is that it allows depth slicing of 3D scene and segment out the object of interest from the background which benefits the detectors for locating the objects in a multiclass classification problem.

 figure: Fig. 5.

Fig. 5. Sample images of partially occluded objects in moderate fog (β = 7.2) for testing the detection network. (a) Visible 2D image, (b-c) reconstructed 3D image at depths z = 2.9 m and z = 3.9 m, respectively in fog and under partial occlusion. (d-f) Thermal (LWIR) images in fog and partial occlusion, (d) LWIR 2D image and (e-f) reconstructed 3D images at depths z = 2.9 m and z = 3.9 m, respectively. (g) RGB image of Azure Kinect camera, (h) depth image of Azure Kinect camera, and (i) RGBD image of Azure Kinect camera in degraded environments (fog and occlusion). The classes used in these experiments are CM (cold mannequin), HM (Hot mannequin), LK (Large Kettle), SK (Short Kettle), IRN (Iron), Wrench, and Glass (Glass beaker).

Download Full Size | PDF

The examples of detection results from YOLOv3 detector for visible, LWIR and Kinect RGBD scenes in the presence of fog and under partial occlusion are shown in Fig. 6. The first row shows the reference 2D images of visible camera [Fig. 6(a)], LWIR camera [Fig. 6(b)], and Kinect camera [Fig. 6(c)]. Figure 6 (d-l) shows the testing images of partially occluded scene recorded in the presence of dense fog with a scattering coefficient (β = 8.5). The second row is an example of detection results in the visible range [Fig. 6 (d-f)]. It can be noted that the detector for visible range 3D InIm has successfully detected the objects present in the scene as shown in Fig. 6(e-f). The visible range InIm detector also successfully distinguished between similar objects like small kettle (SK) and large kettle (LK). However, due to partial occlusion 2D imaging detector could not detect all the classes present in the image shown in Fig. 6(d). However, objects behind the are identified with visible range 3D InIm reconstructions. In case of thermal and RGBD imaging, the 2D imaging, LWIR 3D InIm [Fig. 6(g-i)] and Kinect RGBD images [Fig. 6 (j-l)] fail to successfully detect all the objects present in the scene.

 figure: Fig. 6.

Fig. 6. Object detection and classification results for various sensors in high dense fog (β = 8.5) and partial occlusion. (a-c) Reference 2D images of the scene for visible, LWIR and Kince camera, repectively. Object detection results of various sensors using the testing dataset are presented in (d)-(l). (d) Visible 2D images in fog fails to detect all the objects, (e-f) successful detection of all classes with 3D InIm in fog in visible range. (g-i) LWIR detection results with (g) thermal 2D image which fails to detect the objects, and (h-i) thermal 3D images which also fails to detect all the objects. (j-l) show the detection results using Kinect RGBD dataset which fails to detect all objects. The classes used in these experiments are CM (cold mannequin), HM (Hot mannequin), LK (Large Kettle), SK (Short Kettle), IRN (Iron), Wrench and Glass (Glass beaker).

Download Full Size | PDF

Performance results for the 2D and 3D InIm with visible and LWIR sensors, as well as Azure Kinect RGBD images are summarized in the following Tables. The LWIR InIm detector (with LWIR camera) and RGBD detector misclassify more classes as compared to visible range InIm detector. This is due to limited contrast and texture information offered by LWIR thermal images, and the other reason is that the image content of LWIR camera depicts the distribution of temperature difference. Therefore, the detector could not distinguish between objects of similar shape and size that may be cold. The depth image of the Azure Kinect camera starts to degrade as the concentration of fog increases, and therefore the visibility of the RGBD image becomes poor. For the experiments performed here in degraded environments, visible range 3D InIm provide improved textural information of objects compared to IR imaging and LiDAR which helps the detector to classify the objects of similar shape and size in the scene. Therefore, the performance of detector for visible range 3D InIm outperforms the performance of detectors for other sensors. Tables 3 and 4 show the quantitative value of the average precision and average miss rates for all detectors in the presence of different levels of fog (β = 1.5 to 10) and partial occlusions. It is noted that visible range 3D InIm of partially occluded objects in fog significantly improves the performance of detector as compared to 2D imaging in case of a visible range imaging system. However, the improvement in detection for LWIR 3D InIm as compared to 2D imaging is not significant. The reason could be that due to poor spatial resolution, the textural information of thermal imaging is not improved even after the reconstruction of 3D imaging.

Tables Icon

Table 3. Precision of each class and weighted average of precision for visible, LWIR, and Kinect RGBD sensors. a

Tables Icon

Table 4. Miss rate of each class and weighted average of miss rate for visible, LWIR, and RGBD images.a

Finally, the performance of detectors for all sensor (2D, 3D, and Kinect) are plotted in term of log average miss rate-false positives per image by varying the threshold from 0.5 to 1 in adverse conditions of fog and under partial occlusion, and only fog are shown in Fig. 7 (a), and Fig. 7(b), respectively. The lower the miss rate, the better the performance of the detector. The experimental graphs show that the performance of the detector for visible range 3D InIm outperforms the other detectors in detection and classification of objects in fog and partial occlusion degraded environments.

 figure: Fig. 7.

Fig. 7. Average miss rate versus false positives per image for all detectors. (a) In adverse conditions of fog and partial occlusions, (b) in the adverse condition of fog only. 3D is obtained with integral imaging. RGBD is obtained with Kinect camera. IR: LWIR.

Download Full Size | PDF

Thus, from Tables 3 and 4, we can see that the weighted accuracy for visible range 3D InIm in all degraded environments outperforms the other 2D and 3D imaging system. The value of weighted average precision is 78% for visible range 3D InIm, whereas it is 45% for thermal (LWIR) 3D InIm and 55% for Kinect RGBD imaging for partially occluded objects in fog. Similarly, in the case of only fog, the value of weighted average precision for visible range 3D InIm is 84% whereas it is 51% for thermal (LWIR(3D InIm and 70% for RGBD imaging. Table 5 summarizes the results of object detection and recognition for 3D imaging in visible, LWIR, and Kinect cameras in all degraded conditions.

Tables Icon

Table 5. Comparative results of weighted average precision for depth imaging of visible range, thermal range, and RGBD imaginga

The performance of neural network for object detection and classification depends on the spatial features, contrast, and sharpness of the imaging dataset. Thermal imaging cameras capture the heat energy (thermal photons), which has longer wavelengths compared to visible light, to create images. This means each LWIR detector (pixel) has to be correspondingly larger than those of visible light detectors in order to absorb the larger wavelength [30]. As a result, a thermal camera usually has a lower spatial resolution than a visible light image sensor. Moreover, when the scene and the objects present in it are of relatively homogenous temperature objects of the same shape and size become poorly distinguishable. Hence, the poor resolution of LWIR camera and low contrast of thermal imaging limits the performance of neural network for detection and classification. Azure Kinect camera uses active light source for depth sensing, and the depth information of scene depends on the amount of reflected light back to the Kinect sensor from objects present in the scene. Therefore, depth imaging at long ranges or in scattering environments such as fog or snow becomes challenging because only a small fraction of reflected photons return to the sensor. Hence, in presence of fog Azure Kinect camera provides only depth information of close-range objects. This limits the performance of neural network for object detection and classification. To overcome the aforementioned limitation of depth imaging system, we used depth imaging by employing 3D InIm in the visible range that preserved the textural information, provided better contrast and sharpness of the image. Unlike lidar, InIm provides depth information for both close-range and long-range objects, which improves the performance of neural network.

5. Conclusion

In summary, we have investigated both passive and active (lidar) sensing modalities for object detection and recognition in the presence of different levels of fog (β = 1.5 to 10), with and without partial occlusions. We have performed experiments to compare the performance of object detection and recognition across 2D and 3D visible range sensors, 2D and 3D thermal (LWIR) imaging, and depth fused RGBD of Kinect camera in degraded environments. We have used YOLOv3 neural network for object detection. Passive 3D imaging was performed with a 3D InIm system. For the experiments performed in this study, we found that due to the lack of adequate textural information and poor spatial resolution of thermal imaging systems, the automated LWIR detection systems may fail to classify the objects of same shape and size in the scene. In the case of lidar depth imaging system (Azure Kinect camera), the concentration of fog degraded the visibility of RGBD image, and the detector performed poorly in classifying the objects in the degraded scene. However, visible range 3D InIm system preserved the textural information and spatial resolution of the image, which helped to improve the detection accuracy of automated detection system in the degraded conditions investigated in this paper. The quantitative results show that visible range 3D InIm may significantly improve the accuracy of the detector in classification of objects in degraded conditions. Future work includes investigating various InIm systems and algorithms [3133] to remove the impact of fog on object detection. While we have shown laboratory results, Integral Imaging can operate at very long range as long as elemental images are captured with appropriate imagers [34].

Funding

Office of Naval Research (N000142012690, N000142212349, N000142212375); Air Force Office of Scientific Research (FA9550-21-1-0333).

Acknowledgements

B. Javidi acknowledges support by Air Force Office of Scientific Research (FA9550-21-1-0333) and Office of Naval Research (N000142012690, N000142212375, N00014-22-1-2349). T. O’Connor acknowledges the Department of Education through the GAANN Fellowship. We thank Hamamatsu Photonics K. K. for the C11440-42U camera.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. Hasirlioglu, A. Kamann, I. Doric, and T. Brandmeier, “Test methodology for rain influence on automotive surround sensors,” IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), 2242–2247 (2016).

2. A. Pfeuffer and K. Dietmayer, “Optimal Sensor Data Fusion Architecture for Object Detection in Adverse Weather Conditions,” 2018 21st International Conference on Information Fusion (FUSION), 2588–2595 (2018).

3. X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” Proceedings of the IEEE CVPR, 6526–6534, (2017).

4. T. Rothmeier and W. Huber, “Performance Evaluation of Object Detection Algorithms Under Adverse Weather Conditions,” Intelligent Transport Systems, From Research and Development to the Market Uptake 364, 211–222 (2021). [CrossRef]  

5. S. Komatsu, A. Markman, A. Mahalanobis, K. Chen, and B. Javidi, “Three-dimensional integral imaging and object detection using long-wave infrared imaging,” Appl. Opt. 56(9), D120–D126 (2017). [CrossRef]  

6. P. Wani, K. Usmani, G. Krishnan, T. O’Connor, and B. Javidi, “Lowlight object recognition by deep learning with passive three-dimensional integral imaging in visible and long wave infrared wavelengths,” Opt. Express 30(2), 1205–1218 (2022). [CrossRef]  

7. A. Markman, X. Shen, and B. Javidi, “Three-dimensional object visualization and detection in low light illumination using integral imaging,” Opt. Lett. 42(16), 3068–3071 (2017). [CrossRef]  

8. S. H. Hong and B. Javidi, “Three-dimensional visualization of partially occluded objects using integral imaging,” J. Disp. Technol. 1(2), 354–359 (2005). [CrossRef]  

9. B. Javidi, R. Ponce-Diaz, and S. H. Hong, “Three-dimensional recognition of occluded objects by using computational integral imaging,” Opt. Lett. 31(8), 1106–1108 (2006). [CrossRef]  

10. R. Schulein, C. M. Do, and B. Javidi, “Distortion-tolerant 3D recognition of underwater objects using neural networks,” J. Opt. Soc. Am. A 27(3), 461–468 (2010). [CrossRef]  

11. G. Lippmann, “Epreuves reversibles donnant la sensation du relief,” J. Phys. 7(1), 821–825 (1908).

12. A. P. Sokolov, Autostereoscopy and integral photography by Professor Lippmann’s method. Izd-vo MGU (Moscow State University): Moskva1911.

13. H. E. Ives, “Optical properties of a Lippmann lenticulated sheet,” J. Opt. Soc. Am. 21(3), 171–176 (1931). [CrossRef]  

14. N. Davies, M. McCormick, and L. Yang, “Three-dimensional imaging systems: a new development,” Appl. Opt. 27(21), 4520–4528 (1988). [CrossRef]  

15. C. Burckhardt, “Optimum parameters and resolution limitation of integral photography,” J. Opt. Soc. Am. 58(1), 71–76 (1968). [CrossRef]  

16. Y. Igarashi, H. Murata, and M. Ueda, “3-D display system using a computer generated integral photograph,” Jpn. J. Appl. Phys. 17(9), 1683–1684 (1978). [CrossRef]  

17. J. Y. Son, W. H. Son, S. K. Kim, K. H. Lee, and B. Javidi, “Three-dimensional imaging for creating real-world-like environments,” Proc. IEEE 101(1), 190–205 (2013). [CrossRef]  

18. H. Arimoto and B. Javidi, “Integral three-dimensional imaging with digital reconstruction,” Opt. Lett. 26(3), 157–159 (2001). [CrossRef]  

19. F. Okano, H. Hoshino, J. Arai, and I. Yuyama, “Real-time pickup method for a three-dimensional image based on integral photography,” Appl. Opt. 36(7), 1598–1603 (1997). [CrossRef]  

20. M. Martinez-Corral, A. Dorado, J. C. Barreiro, G. Saavedra, and B. Javidi, “Recent advances in the capture and display of macroscopic and microscopic 3D scenes by integral imaging,” Proc. IEEE 105(5), 825–836 (2017). [CrossRef]  

21. A. Stern and B. Javidi, “Three-dimensional image sensing and reconstruction with time-division multiplexed computational integral imaging,” Appl. Opt. 42(35), 7036–7042 (2003). [CrossRef]  

22. J. S. Jang and B. Javidi, “Three-dimensional synthetic aperture integral imaging,” Opt. Lett. 27(13), 1144–1146 (2002). [CrossRef]  

23. Seung-Hyun Hong, Ju-Seog Jang, and Bahram Javidi, “Three-dimensional volumetric object reconstruction using computational integral imaging,” Opt. Express 12(3), 483–491 (2004). [CrossRef]  

24. J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXivarXiv:1804.02767, (2018). [CrossRef]  

25. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 779–788 (2016).

26. Y. He and Z. Liu, “A Feature Fusion Method to Improve the Driving Obstacle Detection Under Foggy Weather,” IEEE Trans. Transp. Electrific. 7(4), 2505–2515 (2021). [CrossRef]  

27. M. Kristo, M. Ivasic-Kos, and M. Pobar, “Thermal Object Detection in Difficult Weather Conditions Using YOLO,” IEEE Access 8, 125459–125476 (2020). [CrossRef]  

28. K.M. He, J. Sun, and X.O. Tang, “Single image haze removal using dark channel prior,” IEEE Conference on Computer Vision and Pattern Recognition, 1956–1963 (2009).

29. N. Qian, “On the momentum term in gradient descent learning algorithms. Neural networks : the official journal of the International Neural Network Society,” Neural Netw. 12(1), 145–151 (1999). [CrossRef]  

30. “Beyond resolution, sensitivity looms large for infrared thermal imaging cameras,” https://spie.org/news/photonics-focus/julyaug-2022/improving-infrared-thermal-sensing-cameras?SSO=1

31. B. Javidi, F. Pla, J. M. Sotoca, X. Shen, P. Latorre-Carmona, M. Martínez-Corral, R. Fernández-Beltrán, and G. Krishnan, “Fundamentals of automated human gesture recognition using 3D integral imaging: a tutorial,” Adv. Opt. Photonics 12(4), 1237–1299 (2020). [CrossRef]  

32. M. Martinez-Corral and B. Javidi, “Fundamentals of 3D imaging and displays: A tutorial on integral imaging, Lightfield, and plenoptic systems,” Adv. Opt. Photonics 10(3), 512–566 (2018). [CrossRef]  

33. B. Javidi, A. Carnicer, J. Arai, T. Fujii, H. Hua, H. Liao, M. Martínez-corral, F. Pla, A. Stern, L. Waller, Q. H. Wang, G. Wetzstein, M. Yamaguchi, and H. Yamamoto, “Roadmap on 3D integral imaging: sensing, processing, and display,” Opt. Express 28(22), 32266–32293 (2020). [CrossRef]  

34. D. LeMaster, B. Karch, and B. Javidi, “Mid-Wave Infrared 3D Integral Imaging at Long Range,” J. Disp. Technol. 9(7), 545–551 (2013). [CrossRef]  

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (7)

Fig. 1.
Fig. 1. Schematic diagram of 3D InIm using synthetic aperture InIm (SAII). (a) Pickup process of InIm (b) reconstruction of 3D images via virtual pinhole array.
Fig. 2.
Fig. 2. Visible 2D images in different levels of fog. (a) Low (β = 3.2), (b) moderate (β = 7.8), and (c) dense fog levels (β=9.4).
Fig. 3.
Fig. 3. Sample image for training the network in visible and LWIR range. (a) Central perspective of 2D elemental images, (b-d) 3D reconstructed images at different depths in visible range. Corresponding LWIR images for (e) 2D elemental image and, (f-h) reconstructed 3D images at different depths. The classes used in these experiments are CM (cold mannequin), HM (Hot mannequin), LK (Large Kettle), SK (Short Kettle), IRN (Iron), Wrench, and Glass (Glass beaker).
Fig. 4.
Fig. 4. Sample image for training the network in RGBD with Kinect. (a) RGB image of the scene, (b) depth image (D) of the scene, and (c) fused RGBD image of the scene. The classes used in these experiments are CM (cold mannequin), HM (Hot mannequin), LK (Large Kettle), SK (Short Kettle), IRN (Iron), Wrench, and Glass (Glass beaker).
Fig. 5.
Fig. 5. Sample images of partially occluded objects in moderate fog (β = 7.2) for testing the detection network. (a) Visible 2D image, (b-c) reconstructed 3D image at depths z = 2.9 m and z = 3.9 m, respectively in fog and under partial occlusion. (d-f) Thermal (LWIR) images in fog and partial occlusion, (d) LWIR 2D image and (e-f) reconstructed 3D images at depths z = 2.9 m and z = 3.9 m, respectively. (g) RGB image of Azure Kinect camera, (h) depth image of Azure Kinect camera, and (i) RGBD image of Azure Kinect camera in degraded environments (fog and occlusion). The classes used in these experiments are CM (cold mannequin), HM (Hot mannequin), LK (Large Kettle), SK (Short Kettle), IRN (Iron), Wrench, and Glass (Glass beaker).
Fig. 6.
Fig. 6. Object detection and classification results for various sensors in high dense fog (β = 8.5) and partial occlusion. (a-c) Reference 2D images of the scene for visible, LWIR and Kince camera, repectively. Object detection results of various sensors using the testing dataset are presented in (d)-(l). (d) Visible 2D images in fog fails to detect all the objects, (e-f) successful detection of all classes with 3D InIm in fog in visible range. (g-i) LWIR detection results with (g) thermal 2D image which fails to detect the objects, and (h-i) thermal 3D images which also fails to detect all the objects. (j-l) show the detection results using Kinect RGBD dataset which fails to detect all objects. The classes used in these experiments are CM (cold mannequin), HM (Hot mannequin), LK (Large Kettle), SK (Short Kettle), IRN (Iron), Wrench and Glass (Glass beaker).
Fig. 7.
Fig. 7. Average miss rate versus false positives per image for all detectors. (a) In adverse conditions of fog and partial occlusions, (b) in the adverse condition of fog only. 3D is obtained with integral imaging. RGBD is obtained with Kinect camera. IR: LWIR.

Tables (5)

Tables Icon

Table 1. The density of fog and its corresponding Beer’s coefficient and scattering coefficient.

Tables Icon

Table 2. Optimal hyperparameters of object detectors

Tables Icon

Table 3. Precision of each class and weighted average of precision for visible, LWIR, and Kinect RGBD sensors. a

Tables Icon

Table 4. Miss rate of each class and weighted average of miss rate for visible, LWIR, and RGBD images.a

Tables Icon

Table 5. Comparative results of weighted average precision for depth imaging of visible range, thermal range, and RGBD imaginga

Equations (4)

Equations on this page are rendered with MathJax. Learn more.

R z ( x , y , z ) = 1 O ( x , y ) m = 0 M 1 n = 0 N 1 [ I m , n ( x m × L x × p x c x × z / f , y n × L y × p y c y × z / f ) + ε ] ,
Confidence score = p r ( O b j e c t ) × I O U p r e d t r u t h ,
P r e c i s i o n = T P T P + F P , R e c a l l = T P T P + F N , M i s s r a t e = F N F N + T P .
I ( x ) = J ( x ) t ( x ) + A ( 1 t ( x ) ) , t ( x ) = e β d ( x ) ,
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.