Underwater object detection and temporal signal detection in turbid water using 3D-integral imaging and deep learning

Rakesh Joshi; Kashif Usmani; Gokul Krishnan; Fletcher Blackmon; Bahram Javidi

doi:10.1364/OE.510681

1. Introduction

Underwater object recognition and signal detection have become a challenging problem in degraded environments of turbidity and occlusion due to severe scattering and attenuation of light [1–5]. To counter the aforementioned challenges, various approaches have been proposed for signal detection in a turbid medium, such as a correlation filter-based detector [1,2,4], polarimetric-based approaches [2,6], and single-pixel detectors [7], among others. Deep learning-based methods [3] have recently gained attraction for classification and detection tasks due to their learning capabilities and higher accuracy [8–10].

In recent years, deep neural networks have been proposed to detect the signal in atmospheric turbulence and scattering medium [11–13]. It has been shown that they achieve better detection accuracy than conventional methods. Deep learning-based approaches have also been proposed for object detection tasks that can detect underwater objects, but their performance may suffer in adverse conditions such as turbidity and partial occlusions. Under such a scenario, 3D InIm-based approaches prove to be useful [3,10]. 3D InIm strategies can capture the 3D information about the scene, including the intensity and angular information, which is not possible using approaches that only capture the intensity information (2D imaging). The 3D integral imaging can segment the depth to isolate the object of interest from the background scene and partial occlusion. Thus, 3D integral imaging has become a prominent technique for object recognition tasks under degradation [14]. Previously, a deep learning-based approach with multidimensional integral imaging for detecting temporally encoded optical signals detection in the turbid medium has been proposed [3]. In such a scenario, 3D integral imaging outperformed conventional 2D imaging for signal detection capabilities. This paper presents a hybrid system that enables us to detect objects and temporal signals from an underwater scene. The proposed method utilizes the You Only Look Once (YOLO) deep convolutional neural network with a recurrent neural network to extend the analysis into the spatial-temporal domain. A CMOS image sensor array captures the underwater scene, and the recorded 2D elemental video frames are reconstructed using 3D integral imaging. In this work, we used YOLO v4 to detect underwater objects, which incorporates techniques such as skip connection, and the system performance is compared in terms of average precision and average miss rate [15].

In the proposed integrated object classification and temporal optical signal detection using integral imaging, temporally encoded optical signals are transmitted through turbid and partially occluded underwater environments. A CMOS image sensor array captures this signal, and the recorded 2D elemental video frames are reconstructed using 3D integral imaging to improve the temporal optical signal detection accuracy under degraded conditions. The reconstructed object depth is estimated by using the minimum variance depth estimation approach [16]. The processed video sequence is fed into the CNN-BiLSTM-based detector to detect the encoded source light signal. To perform the signal classification, the CNN-BiLSTM-based network is trained without occlusion in turbid water ranges α from 0.0025 to 0.0101 mm^-1. Here α (Beer’s coefficient) is used to quantify the turbidity levels. The performance of the proposed system is measured using performance metrics such as Matthew’s correlation coefficient (MCC), receiver operating characteristic (ROC) curves, the area under the curve (AUC), and the number of bit errors. By using the proposed approach, it's possible to create comprehensive underwater surveillance systems. These systems can detect and identify underwater objects such as submarines or unauthorized divers while also enabling communication between surveillance devices for data transmission, enhancing maritime security. Further, it enhances the capabilities of underwater robotics, enabling navigation, data collection, and object identification. Thus, it has potential applications in various domains, including security, environmental monitoring, and robotics.

The paper is organized as follows: a brief review of integral imaging, the approach used for object detection, and temporal signal detection methods are described in Section 2, followed by the details of the optical imaging system and underwater data collection procedure as described in Section 3. The experimental results and discussions are given in Section 4, and finally, the conclusions are presented in Section 5.

2. Methodology

2.1 3D integral imaging

Integral imaging (InIm) is a 3D imaging technology that comprises the camera pickup and computational reconstruction process, which was initially suggested by Lippmann [17]. Multiple 2D elemental pictures are recorded during the pickup or capture operation, and optical rays’ intensity and directional information are recorded [18–21]. As illustrated in Fig. 1, these elemental images can be captured using a lenslet array, an array of cameras, or a moving camera. A pinhole model is assumed in the reconstruction process. The rays are back-propagated through their corresponding virtual pinholes at a certain depth plane to provide depth information about the 3D scene [19–21]. The 3D reconstruction process can be stated mathematically as follows:

(1)$${I^{rec}}({x,y;z;t} )= \frac{1}{{O({x,y;z;t} )}}\mathop \sum \nolimits_{m = 0}^{K - 1} \mathop \sum \nolimits_{n = 0}^{L - 1} E{I_{m,n}}\left( {x - m\frac{{{N_x}{P_x}f}}{{{C_x}z}},n\frac{{{N_y}{P_y}f}}{{{C_y}z}};t} \right)$$

where I^rec (x, y; z; t) is the integral imaging reconstructed image at depth z and at time frame t, x and y are the pixel indices of each elemental image. O(x, y; z; t) is the number of overlapping pixels at time frame t. z is the reconstruction distance denoted as z = z_air +z_w/n_w, where z_air is the distance in air medium, z_w is the distance in the water, and n_w is the refractive index of water. The total number of pixels in each elemental image is represented by N_x, N_y, the number of elemental images acquired in the in x and y directions is represented by K and L, and the elemental image in the m^th column and n^th row is represented by EI_m,n (·). The reconstructed image is created on a certain depth plane by moving and overlaying the elemental images. The pitch between neighboring image sensors on the camera array is P_x and P_y, respectively, while the focal length of the camera lens is f. The width and height of the image sensor, respectively, are C_x and C_y. Figure 1(a) and 1(b) show the camera pickup and computational reconstruction processes, respectively.

Fig. 1. 3D integral imaging process: (a) Pickup (image capture) stage, and (b) computational volumetric reconstruction process for an integral imaging system.

Download Full Size | PDF

2.2 Object detection and temporal signal detection method

The proposed approach combines the YOLO deep convolutional neural network with the CNN-BiLSTM to extend the analysis into the spatial-temporal domain. YOLO is an object recognition technique which predicts the bounding boxes for objects in an image in a single-stage fashion. The network divides the input images into an S ${\times} $ S grid; if the center of an object falls into the grid cell, that grid cell is responsible for the detection. The primary loss function used in YOLO is based on the sum of the square errors between the predicted and ground truth bounding boxes and the confidence score and class probabilities [15]. In this work, we used YOLO v4, which incorporates techniques such as skip connection, feature pyramid network, and anchor boxes to improve the algorithm's accuracy and speed. YOLO v4 comprises three parts: backbone, neck, and head. The backbone network uses CSPDarkNet53, which acts as a feature extraction network while reducing network redundancy and enhancing feature representation. The neck network consists of Spatial pyramid pooling (SPP) and Path aggregation network (PAN). The SSP concatenates the max-pooling outputs of the low-resolution feature map to extract the most representative features, enabling the model to handle objects of different sizes effectively. The concatenated feature maps from the SPP module are fused with the high-resolution feature maps by using a PAN. The PAN uses upsampling and downsampling operations to set bottom-up and top-down paths for combining low-level and high-level features. It outputs a set of aggregated feature maps to use for predictions. The head takes features from the neck module and predicts the bounding boxes, objectiveness score, and classification score [15]. The YOLOv4 network uses a detection head similar to that of YOLOv3. These typically include multiple convolutional layers, followed by a combination of different-sized anchor boxes, and predict bounding boxes and class probabilities for each anchor box. The YOLO v4 head generates predictions by assigning class probabilities and bounding box coordinates to the anchor boxes. Once the predictions are obtained, a non-maximum suppression (NMS) process is applied to filter out low-confidence detections and refine the bounding boxes. The final output of the YOLO v4 head is a set of bounding boxes, along with their associated class labels and confidence scores, representing the detected objects in the input image. In this work, we used YOLO v4 to detect underwater objects. The complete network structure is shown in Fig. 2.

Fig. 2. The network structure of YOLOv4 for object classification. The input is the underwater scene with turbidity and occlusion. The output is the classified object plus its range.

Download Full Size | PDF

For spatial-temporal optical signal detection, a video sequence is fed through a recurrent neural network to detect the spatial-temporal optical signal. Here, a CNN-BiLSTM-based deep neural network is used. A pre-trained CNN is used for extracting the feature vectors from each frame of the captured video sequence. A pre-trained model can transfer learned feature vectors to a newer problem using several training sets. A pretrained GoogLeNet [22] trained on the ImageNet dataset [23] extracts the spatial features. The extracted video sequence features are fed into a bi-directional long short-term memory (BiLSTM) network, shown in Fig. 3. The BiLSTM network is a variant of a long short-term memory (LSTM) network. It consists of two separate LSTM networks that are used to extract the temporal information in the forward and backward directions [24–27]. Consider a t-step input sequence that is sent through a BiLSTM network [26]. The update at the t-th hidden vector in a BiLSTM network is updated by both forward and backward directions. Equation (2) depicts the forward LSTM flow. Similarly, the reverse equation is derived by replacing → with ←.

(2)$$\left. \begin{array}{l} \overrightarrow {{i_t}} = \sigma ({\overrightarrow {{W_{xi}}} {x_t} + \overrightarrow {{W_{hi}}} {h_{t - 1}} + \overrightarrow {{b_i}} } )\\ \overrightarrow {{f_t}} = \sigma ({\overrightarrow {{W_{xf}}} {x_t} + \overrightarrow {{W_{hf}}} {h_{t - 1}} + \overrightarrow {{b_f}} } )\\ \overrightarrow {{o_t}} = \sigma ({\overrightarrow {{W_{xo}}} {x_t} + \overrightarrow {{W_{ho}}} {h_{t - 1}} + \overrightarrow {{b_o}} } )\\ \overrightarrow {{g_t}} = \textrm{tanh}({\overrightarrow {{W_{xc}}} {x_t} + \overrightarrow {{W_{hc}}} {h_{t - 1}} + \overrightarrow {{b_c}} } )\\ \overrightarrow {{c_t}} = \overrightarrow {{f_t}} \odot \overrightarrow {{c_{t - 1}}} + \overrightarrow {{i_t}} \odot \overrightarrow {{g_t}} \\ \overrightarrow {{h_t}} = \overrightarrow {{o_t}} \odot \textrm{tanh}({\overrightarrow {{c_t}} } )\end{array} \right\}$$

where → and ← represent the forward and backward directions, respectively. i_t, f_t and o_t represent the input gate, forget gate, and output gates, respectively. To rescale the signal to [0, 1], these three gates utilize a sigmoid function, $\sigma (x )= \frac{1}{{1 + {e^{ - x}}}}$. The modulated gate and hidden state at the t-th time are denoted by g_t and h_t respectively. The modulated gate uses a hyperbolic tangent function $tanh(x )= \frac{{{e^x} - {e^{ - x}}}}{{{e^x} + {e^{ - x}}}}$ rescale the signal to (-1, 1). Here, W_kl and b_k, $[k \in \{ k,h\}$ and $[l \in \{ o,c,f,i\}$ represent the associated weight matrices and bias terms of the network. In this case, ${\odot} $ denotes element-wise multiplication. The output of these forward and backward layers is combined, resulting in ${h_t} = f(\overleftarrow {{h_t}} ,\overrightarrow {{h_t})} $, where f is the concatenation operator.

Fig. 3. The network structure of CNN-Bi-LSTM based optical signal detection. Temporally encoded signals in turbid water are detected as described in Section 2

Download Full Size | PDF

The information encoded in an M-frame sequence can be used to produce a k dimensional feature vector ${\textrm{x}_i}$, for each frame ${I_n}$, n = 1,2,3,…M. Thus, for each M frame sequence, we get the feature matrix X by concatenating the feature vectors ${\textrm{x}_i}$. The feature matrix X for each video is then fed into a Bi-LSTM network, and its output is fed to a fully connected layer followed by a softmax layer for classification. In the training phase, reconstructed 7- frame video sequences captured in clear water without occlusion are used to train an end-to-end CNN-BiLSTM network.

3. Experimental methods

We propose a multidimensional integral imaging-based experimental setup for underwater data collection in a turbid and occluded environment as shown in Fig. 4. The underwater scene consists of underwater objects and a light source of 630 nm wavelength inside a turbid water tank with dimensions of 500(W) × 250(L) × 250(H) mm. The light source is used to transmit optical signals through turbid water. The signal is transmitted through a turbid underwater medium with occlusion at a speed of 20 bits per second using a light-emitting diode (LED). Placing an artificial plant in the water tank creates a partially occluded underwater environment. The signal transmission and receiving cameras in our studies are synchronized to the point that each recorded frame corresponds to the transmission of exactly 1 bit of the coded transmitted signal. This synchronization was done in order to transmit signals at the maximum speed allowable by the recording camera's framerate. Recent developments in camera technology (e.g., FastCam SA-X2; Photron) allow much faster frame rates of 1 million frames per second [28]. The turbidity of the water is regulated by adding an antacid, and experiments were conducted at various turbidity levels to evaluate the proposed system's effectiveness at various turbidities. We employ Beer's coefficient to measure turbidity, derived from Beer-Lambert's law, which states that I = I₀e^-αd, where I₀ is initial intensity, I is the intensity after propagating a distance d in a turbid medium, and α is Beer’s coefficient. A sample of water from each turbidity condition was obtained for testing, and the intensity of light I₀ and I was measured at two locations with d = 76 mm, where d is the distance between the I₀ and I. The optical power meter and detector (Newport 818-SL/DB Silicon) are used to measure the intensities I₀ and I at a wavelength of 630 nm with a 10 mm aperture. A 3 × 3 camera array comprising G-192 GigE cameras and C-mount zoom lenses with a focal length of 20 mm was used to record the underwater scene. The camera array is synchronized to capture video data at a frame rate of 20 frames per second. The distance between cameras is 80 mm on horizontal and vertical axes. The camera sensor has a spatial resolution of 1600(H) 1200(V) and a pixel size of 4.5 µm × 4.5 µm.

Fig. 4. Experimental setup for an integrated dual-function underwater object detection and temporal optical signal detection using integral imaging.

Download Full Size | PDF

For underwater object detection, the sample image of the training data set is shown in Fig. 5. The 3D underwater scene is obtained and reconstructed using the integral imaging technique. The 2D image in the scene is the central perspective of the elemental images, as shown in Fig. 5(a) and 5(b), and the corresponding 3D scene of the underwater objects at different depths (Z = 1.8 m and Z = 2.1 m) reconstructed using the integral imaging algorithm is shown in Fig. 5(a(i)-a(ii)) and 5(b(i)-b(ii)). Training datasets for object detection are recorded without occlusion in turbid water with α ranging from 0.0025 to 0.0101 mm^-1. A total of 300 scenes with different backgrounds and different turbidity levels were recorded for training the network. The YOLO v4 network was trained with Adam optimizers. The hyperparameters such as batch size, number of epochs, and learning rates were tuned on a validation dataset different from the training dataset. The hyperparameters are tuned via grid search, and the range of values of batch size, number of epochs, and learning rate are (2, 4, 8, 16), [20, 40, 60, 100], and [1 × 10−4,5 × 10−4,1 × 10−3,5 × 10−3], respectively using early stop criteria. The two different YOLO v4 detectors were trained using a training data set to compare: 2D and 3D InIm. The best hyperparameters for 2D detectors are 4 (batch size), 20 (epochs) and 1 × 10−3 (learning rate), for 3D detectors are 16 (batch size), 40 (epochs), and 1 × 10−4 (learning rate). In this experiment, two different classes were chosen for object detection.

Fig. 5. Sample images for training the object classification network for 2D and 3D InIm. (a) Sample 2D elemental images in clear water (α = 0.0025 mm⁻¹); (a(i)-a(ii)) 3D reconstructed images at different depths (Z = 1.8 m and Z = 2.1 m) in clear water. (b) Sample 2D elemental images in turbidity at α = 0.0101 mm−1; (b(i)-b(ii)) 3D reconstructed images at different depths (Z = 1.8 m and Z = 2.1 m) in turbidity at α = 0.0101 mm⁻¹. InIm: Integral imaging; Z: reconstructed depth.

Download Full Size | PDF

For optical signal detection, the gold code sequence was used to encode the optical signals [29]. The 7-bit gold code is sent when the transmitted data is 1, and the 7-bit flipped gold code is sent when the transmitted data is 0. As a result, the Bi-LSTM network could use the temporal structural difference for classification. We use an 8-bit data sequence of [1, 0, 0, 1, 1, 0, 1, 0], with each bit of the original data encoded with a 7-bit gold code ([1,1,0,0,1,0,1]), generating a 56-bit encoded signal. In this experiment, α ranges from 0.0027 to 0.0391 mm^-1. After receiving the transmitted video sequence, the sliding window technique is used to obtain a series of 7 frame video sequences, U_i (i, i + 1,…, i + 6), from the transmitted signal, i = 1,2,3,…N-6 and N is the total number of frames in the encoded, transmitted video sequence [3]. For a 56 frame video (8-bit binary data coded with 7-bit gold code), there are 36 distinct possible 7-frame video sequences, and can be grouped into three different cases: case 1) where the 7-frame video sequence exactly overlaps with the gold code sequence (class ‘1’), case 2) where the 7-frame video sequence exactly overlaps with the flipped gold code sequence (class ‘0’), and case 3) the remaining 34 possibilities where the sliced 7-frame video sequence does not exactly match the gold code nor the flipped gold code sequences (‘idle’ class). As a result, three distinct classes are constructed and labeled as ‘1’, ‘0’, and ‘idle’ to train CNN-BiLSTM-based neural networks. To train the network, the transmitted signals are recorded in turbid water without occlusion, α ranges from 0.0025 to 0.0101 mm^-1. We collected data using F-1.8 (F-number) and performed 3D reconstruction at the estimated depth. Signal depth is estimated using a minimum variance technique [16]. For the training data set, we transmitted 32 symbols of the signal, each symbol coded with a 7-bit gold code encoding scheme. While applying the 7-frame sliding window approach to the recorded data, we have 16 videos corresponding to class ‘1’, 16 videos corresponding to class ‘0’, and 34 unique video sequences corresponding to class ‘idle’. The training data set is imbalanced since the number of videos in the ‘idle’ classes is much larger than in the ‘1’ and ‘0’ classes. To handle imbalanced data, we use a random oversampling approach, which takes copies or samples with replacements from the minority classes (classes ‘1’ and ‘0’) until the minority classes have the same number of samples as the majority classes in the training data. Thus, we made 34 videos for each class. So, for each turbidity level, we have 102 (34 × 3) videos; for three different turbidity levels, we have 306 (102 × 3) videos. The test data was collected in the turbid and partially occluded environment using the same experimental setup shown in Fig. 4. Six different turbidity levels were recorded, ranging from α= 0.0027 to 0.0391 mm^-1. In our experimental conditions, test data was recorded under turbid water and partial occlusion. However, the training data doesn’t consist of partial occlusion. This has been done since, in real-world scenarios, occlusions are random and completely unknown. Employing integral imaging techniques notably mitigates the effects of partial occlusion, aligning the distribution of test data more closely with the patterns observed in the training dataset. Consequently, this enhancement enables neural networks to exhibit improved performance, even when presented with unknown instances of partial occlusion.

4. Experimental results and discussion

To test the performance of the YOLOv4 detector, we recorded data with occlusion and turbidity, as shown in Fig. 6. The first column shows the 2D image of two different scenes. Figure 6(a) shows the testing images of partially occluded underwater objects recorded at α= 0.0027. Figure 6(a(i)-a(ii)) represents the 3D reconstructed image of underwater objects at different depths (Z = 1.9 m and Z = 2.15 m). Figure 6(b) shows the test images of underwater objects along with the partially occluded underwater optical signal light source at α= 0.0325. It can be noted that the detector for 3D InIm has successfully detected the objects present in the scene, as shown in Fig. 6(b(i)-b(ii)).

Fig. 6. Sample image for test data for object classification. (a) Sample 2D elemental images at α = 0.0027 mm⁻¹; (a(i)-a(ii)) Classified 3D reconstructed images at different depths (Z = 1.8 m and Z = 2.1 m) at α = 0.0027 mm⁻¹. (b) Sample 2D elemental images at α = 0.0325 mm⁻¹; (b(i)-b(ii)) Classified 3D reconstructed images at different depths (Z = 1.8 m and Z = 2.1 m) at α = 0.0325 mm⁻¹. Z: reconstructed depth.

Download Full Size | PDF

Table 1 shows the quantitative results while using the average precision and average miss rates for all detectors in the presence of different levels of turbidity and partial occlusions. It is noted that 3D InIm of partially occluded objects improves the performance of the detector as compared to 2D imaging.

Table 1. Precision and miss rate of each class at different turbidity levels.^a

View Table | View all tables in this article

To test the spatial-temporal optical signal detection performance, we recorded data with occlusion and turbidity; a sample 2D elemental image is recorded using a central camera with an aperture size of F1.8 as shown in Fig. 7(a). The integral imaging-based reconstructed image at depths Z = 2.3 m is illustrated in Fig. 7(b) at α = 0.0087 mm⁻¹. Figure 7(c) shows 2D elemental image and Fig. 7(d) integral imaging-based reconstructed image at depths Z = 2.3 m and turbidity α= 0.0325 mm⁻¹.

Fig. 7. Sample test data in signal detection. (a) Sample 2D elemental images at α = 0.0087 mm⁻¹; (b) 3D reconstructed optical signal at depths Z = 2.3 m and turbidity α = 0.0087 mm⁻¹. (c) Sample 2D elemental images at α = 0.0325 mm⁻¹; (d) 3D reconstructed optical signal at depths Z = 2.3 m and turbidity α = 0.0325 mm⁻¹. Z: reconstructed depth.

Download Full Size | PDF

In the underwater optical signal detection experiment, the sliding window with the CNN-BiLSTM classifier is used to detect the binary transmitted signal [3]. The classification scores obtained from the CNN-BiLSTM-based classification for each U_i (i, i + 1,…,i + 6) video are transformed into a modified classification score sequence S(i). To build S(i), we pick the maximum value between the classification scores of ‘0’ and ‘1’; if the selected value corresponds to class ‘0,’ it is multiplied by -1; otherwise, if it corresponds to class ‘1,’ it remains intact [3]. The transmitted video sequence's transformed classification scores S(i) should contain high and low peaks that correspond to the transmission of binary sequences 1 and 0, respectively. Given the 8-bit original data and 7-bit coding procedure, the updated classification score S(i) is predicted to have 8 prominent spots of either local minima or maxima values, each separated by 7 bits. The first video sequence identified as class ‘1’ or class ‘0’ by CNN-BiLSTM detector is the first starting peak of signal transmission. The modified classification score S(i) can be used to perform the final classification of the transmitted binary data sequences. We set the threshold to 0 for our results. The encoded 56-bit signal transmission experiment is performed eight times for each turbidity level, equivalent to the 64-bits of original data to be decoded to compute the performance metrics. For evaluating the performance of the detector with different modalities, we have computed Mathew's correlation coefficient (MCC) as a performance metric [30]. It accounts for true and false positives and negatives and is widely recognized as a balanced metric that can be applied even when the classes are of varying sizes or unbalanced. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 is an inverse prediction. These MCC values at various turbidity levels are reported in Table 1. The MCC can be described in terms of a confusion matrix C for K classes in the multiclass scenario as

(3)$$MCC = \frac{{c \times s - \mathop \sum \nolimits_k^K {p_k} \times {t_k}}}{{\sqrt {\left( {{s^2} - \mathop \sum \nolimits_k^K p_k^2} \right) \times \left( {{s^2} - \mathop \sum \nolimits_k^K t_k^2} \right)} }}$$

where ${t_k} = \mathop \sum \nolimits_i^K {C_{ik}}$ is the number of times class k truly occurred, ${p_k} = \mathop \sum \nolimits_i^K {C_{ki}}$ is the number of times class k was predicted, $c = \mathop \sum \nolimits_k^K {C_{kk}}$ is the total number of samples correctly predicted, and $s = \mathop \sum \nolimits_k^K \mathop \sum \nolimits_j^K {C_{ij}}$ is the total number of samples.

From Table 2, we can see that the proposed 3D InIm achieves a significant performance improvement as compared to 2D conventional imaging system due to the presence of partial occlusion. The MCC performance metric is used to measure the quality of CNN- BiLSTM classifications network. Further, the performance of the signal detection systems after applying sliding window method with CNN-BiLSTM-based classification approach for binary signal detection are evaluated using the receiver operating characteristic (ROC) curves, the area under the curve (AUC), and the number of detection errors. Figure 8 shows the ROC curve for underwater signal detection for turbidity (α = 0.039 mm⁻¹) level. The ROC for the 3D integral imaging system (blue line) method achieves an AUC of 0.930, which outperforms the other tested methods. The conventional 2D imaging (black line) gives AUC values of 0.422, at turbidity level of α = 0.039 mm⁻¹.

Fig. 8. The ROC (receiver operating characteristic) curve in underwater signal detection for Beer’s coefficient α=0.39 mm^-1. The performance of the systems is compared between the 3D InIm (blue line) and conventional 2D imaging (black line). InIm: Integral imaging

Download Full Size | PDF

Table 2. Mathew's correlation coefficient (MCC) at various turbidity levels for 2D imaging and 3D integral imaging.^a

View Table | View all tables in this article

Figure 9 shows the area under the ROC curve (AUC) and the number of detection errors as a function of Beer's coefficient. The area under the curve (AUC) is shown against Beer's law coefficient in Fig. 9(a). The area under the curve (AUC) using the proposed approach is higher than the AUC of all other tested methods. Also, as shown in Fig. 9(b), the proposed method's number of detection errors remains lower than the number of detection errors of all other techniques tested.

Fig. 9. (a) Number of detection errors, and (b) area under curves (AUC) for underwater signal detection at various turbidity levels. Results are compared between 3D InIm (blue line) and conventional 2D imaging (Black line). InIm: Integral imaging.

Download Full Size | PDF

The experimental comparison in Fig. 8 and Fig. 9 reveals that the proposed 3D InIm-based detection method may outperform the 2D conventional CNN-BiLSTM-based approach in challenging experimental conditions such as occlusion and turbidity.

3D integral imaging has shown superiority over 2D imaging systems in object detection and classification due to its ability to reconstruct depth information of the scenes, which benefits in isolating the object of interest from the background and partial occlusions. Integral imaging demands higher computational and storage resources due to the involvement of multiple perspectives and their reconstruction to extract three-dimensional information. However, these do not play a significant role in many safety-critical systems, as their foremost priority is overall accuracy and precision. The advancements in computational capabilities, compression techniques, and storage systems make 3D integral imaging more viable for various applications where speed and storage considerations are critical. Thus, at this point, we do not focus on these aspects. In the future, we may perform a thorough analysis of system requirements.

5. Conclusion

In conclusion, we have presented an integrated dual-function object detection and temporal signal detection system for degraded environments such as turbidity and partial occlusion. We compared the proposed method's performance to a traditional 2D imaging system, and our experimental results indicate that the 3D integral-based proposed method may substantially improve the performance of object detection and temporal signal detection systems compared to other imaging modalities under degraded environments such as partial occlusion and turbidity. At this point, we haven’t focused on the computational complexity aspect; however, the computational complexity highly depends on the 3D integral imaging configuration. This complexity is influenced by parameters such as the resolution of elemental images. Higher resolutions demand more computational resources for processing and reconstruction and the number of perspectives captured. There's often a trade-off between higher quality (higher depth range, higher resolutions) and increased computational demands. Thus, the configuration choice is based on the application requirement of the proposed approach. In the future, we may explore the minimum number of received array elements to achieve better computation complexity and compare the reduced number of cameras with the 3D performance achieved with a 3 × 3 array. Furthermore, we may find the optimal configuration in the context of the placement of the object, light source, transmit signal source, and camera array.

Similarly, we can also address the problem of color distortion since colors are absorbed differently at various depths. Using the integral imaging technique, we can become aware of how colors are affected at different depths, which calculates the color-restored factors based on depth and minimizes distortion [31]. Also, future experiments might include detection in more complex environments containing more objects and challenging conditions such as underwater turbulence [32], low light intensity, multipath fading channels, and exploring other integral imaging architectures [33] and algorithms [34].

Funding

Office of Naval Research (N000142212349, N000142212375); Air Force Office of Scientific Research (FA9550-21-1-0333).

Acknowledgments

We wish to acknowledge support under The Office of Naval Research (ONR) (N000142212375, N000142212349), Air Force Office of Scientific Research (FA9550-21-1-0333). Gokul Krishnan acknowledges support via the General Electric (GE) graduate fellowship for excellence.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. R. Joshi, T. O’Connor, X. Shen, et al., “Optical 4D signal detection in turbid water by multidimensional integral imaging using spatially distributed and temporally encoded multiple light sources,” Opt. Express 28(7), 10477–10490 (2020). [CrossRef]

2. R. Joshi, G. Krishnan, T. O’Connor, et al., “Signal detection in turbid water using temporally encoded polarimetric integral imaging,” Opt. Express 28(24), 36033–36045 (2020). [CrossRef]

3. G. Krishnan, R. Joshi, T. O’Connor, et al., “Optical signal detection in turbid water using multidimensional integral imaging with deep learning,” Opt. Express 29(22), 35691–35701 (2021). [CrossRef]

4. S. Komatsu, A. Markman, and B. Javidi, “Optical sensing and detection in turbid water using multidimensional integral imaging,” Opt. Lett. 43(14), 3261–3264 (2018). [CrossRef]

5. B. Javidi, A. Carnicer, J. Arai, et al., “Roadmap on 3D integral imaging: sensing, processing, and display,” Opt. Express 28(22), 32266–32293 (2020). [CrossRef]

6. M. Dubreuil, P. Delrot, I. Leonard, et al., “Exploring underwater target detection by imaging polarimetry and correlation techniques,” Appl. Opt. 52(5), 997–1005 (2013). [CrossRef]

7. E. Tajahuerce, V. Durán, P. Clemente, et al., “Image transmission through dynamic scattering media by single-pixel photodetection,” Opt. Express 22(14), 16945–16955 (2014). [CrossRef]

8. N. Cohen, S. Shmilovich, Y. Oiknine, et al., “Deep neural network classification in the compressively sensed spectral image domain,” J. Electron. Imag. 30(04), 1–10 (2021). [CrossRef]

9. H. Lee, I. Lee, T. Q. S. Quek, et al., “Binary signaling design for visible light communication: a deep learning framework,” Opt. Express 26(14), 18131–18142 (2018). [CrossRef]

10. G. Krishnan, R. Joshi, T. O’Connor, et al., “Human gesture recognition under degraded environments using 3D-integral imaging and deep learning,” Opt. Express 28(13), 19711–19725 (2020). [CrossRef]

11. H. Bakır and K. Elmabruk, “Deep learning-based approach for detection of turbulence-induced distortions in free-space optical communication links,” Phys. Scr. 98(6), 065521 (2023). [CrossRef]

12. M. A. Amirabadi, M. H. Kahaei, and S. A. Nezamalhosseini, “Deep learning based detection technique for FSO communication systems,” Phys. Commun. 43, 101229 (2020). [CrossRef]

13. S. Avramov-Zamurovic, A. T. Watnik, J. R. Lindle, et al., “Machine learning-aided classification of beams carrying orbital angular momentum propagated in highly turbid water,” J. Opt. Soc. Am. A 37(10), 1662–1672 (2020). [CrossRef]

14. K. Usmani, T. O’Connor, P. Wani, et al., “3D object detection through fog and occlusion: passive integral imaging vs active (LiDAR) sensing,” Opt. Express 31(1), 479–491 (2023). [CrossRef]

15. J. Redmon, S. Divvala, R. Girshick, et al., “You Only Look Once: Unified, Real-Time Object Detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 779–788.

16. M. DaneshPanah and B. Javidi, “Profilometry and optical slicing by passive three-dimensional imaging,” Opt. Lett. 34(7), 1105–1107 (2009). [CrossRef]

17. G. Lippmann, “Épreuves réversibles donnant la sensation du relief,” J. Phys. Theor. Appl. 7(1), 821–825 (1908). [CrossRef]

18. X. Xiao, B. Javidi, M. Martinez-Corral, et al., “Advances in three-dimensional integral imaging: Sensing, display, and applications [Invited],” Appl. Opt. 52(4), 546–560 (2013). [CrossRef]

19. S.-H. Hong, J.-S. Jang, and B. Javidi, “Three-dimensional volumetric object reconstruction using computational integral imaging,” Opt. Express 12(3), 483–491 (2004). [CrossRef]

20. B. Javidi, R. Ponce-Díaz, and S.-H. Hong, “Three-dimensional recognition of occluded objects by using computational integral imaging,” Opt. Lett. 31(8), 1106–1108 (2006). [CrossRef]

21. M. Martínez-Corral and B. Javidi, “Fundamentals of 3D imaging and displays: a tutorial on integral imaging, light-field, and plenoptic systems,” Adv. Opt. Photonics 10(3), 512–566 (2018). [CrossRef]

22. C. Szegedy, W. Liu, Y. Jia, et al., “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1–9 (2015).

23. J. Deng, W. Dong, R. Socher, et al., “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 248–255 (2009).

24. S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput. 9(8), 1735–1780 (1997). [CrossRef]

25. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

26. M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Process. 45(11), 2673–2681 (1997). [CrossRef]

27. J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, et al., “Beyond short snippets: Deep networks for video classification,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4694–4702 (2015).

28. “FASTCAM SA-X2,” https://photron.com/fastcam-sa-x2/.

29. R. Gold, “Optimal binary sequences for spread spectrum multiplexing (Corresp.),” IEEE Trans. Inf. Theory 13(4), 619–621 (1967). [CrossRef]

30. G. Jurman, S. Riccadonna, and C. Furlanello, “A Comparison of MCC and CEN Error Measures in Multi-Class Prediction,” PLoS ONE 7(8), e41882 (2012). [CrossRef]

31. J. Zhou, D. Zhang, W. Ren, et al., “Auto color correction of underwater images utilizing depth information,” IEEE Geoscience and Remote Sensing Letters 19, 1–5 (2022). [CrossRef]

32. Z. Vali, A. Gholami, Z. Ghassemlooy, et al., “Experimental study of the turbulence effect on underwater optical wireless communications,” Appl. Opt. 57(28), 8314–8319 (2018). [CrossRef]

33. M. Li and H. Li, “Application of deep neural network and deep reinforcement learning in wireless communication,” PLoS One 15(7), e0235447 (2020). [CrossRef]

34. J. M. Haut, R. Fernandez-Beltran, M. E. Paoletti, et al., “A new deep generative network for unsupervised remote sensing single-image super-resolution,” IEEE Trans. Geosci. Remote Sensing 56(11), 6792–6810 (2018). [CrossRef]

Beer’s Coefficient (α)		Shark		Submarine
Beer’s Coefficient (α)		Precision	Miss rate	Precision	Miss rate
0.0027	2D	0.847	0.156	0.739	0.260
0.0027	3D InIm	0.970	0.050	0.932	0.080
0.0087	2D	0.680	0.318	0.600	0.470
0.0087	3D InIm	0.880	0.110	0.850	0.130
0.0117	2D	0.587	0.217	0.500	0.500
0.0117	3D InIm	0.820	0.160	0.799	0.205
0.0251	2D	0.467	0.558	0.350	0.688
0.0251	3D InIm	0.624	0.386	0.595	0.404
0.0325	2D	0.293	0.709	0.227	0.772
0.0325	3D InIm	0.426	0.570	0.411	0.617
0.0391	2D	0.061	0.918	0.053	0.943
0.0391	3D InIm	0.168	0.843	0.106	0.901

Beer’s Coefficient (α)	Mathew's correlation coefficient (MCC)
Beer’s Coefficient (α)	3D Integral Imaging	2D Conventional Imaging
0.0027	1	.10
0.0087	1	0.08
0.0117	1	0.07
0.0251	1	0.05
0.0325	1	0
0.0391	0.9	0

Beer’s Coefficient (α)		Shark		Submarine
Beer’s Coefficient (α)		Precision	Miss rate	Precision	Miss rate
0.0027	2D	0.847	0.156	0.739	0.260
0.0027	3D InIm	0.970	0.050	0.932	0.080
0.0087	2D	0.680	0.318	0.600	0.470
0.0087	3D InIm	0.880	0.110	0.850	0.130
0.0117	2D	0.587	0.217	0.500	0.500
0.0117	3D InIm	0.820	0.160	0.799	0.205
0.0251	2D	0.467	0.558	0.350	0.688
0.0251	3D InIm	0.624	0.386	0.595	0.404
0.0325	2D	0.293	0.709	0.227	0.772
0.0325	3D InIm	0.426	0.570	0.411	0.617
0.0391	2D	0.061	0.918	0.053	0.943
0.0391	3D InIm	0.168	0.843	0.106	0.901

Beer’s Coefficient (α)	Mathew's correlation coefficient (MCC)
Beer’s Coefficient (α)	3D Integral Imaging	2D Conventional Imaging
0.0027	1	.10
0.0087	1	0.08
0.0117	1	0.07
0.0251	1	0.05
0.0325	1	0
0.0391	0.9	0

Underwater object detection and temporal signal detection in turbid water using 3D-integral imaging and deep learning

Abstract

1. Introduction

2. Methodology

2.1 3D integral imaging

2.2 Object detection and temporal signal detection method

3. Experimental methods

4. Experimental results and discussion

5. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (2)

Equations (3)

Optics Express