End-to-end integrated pipeline for underwater optical signal detection using 1D integral imaging capture with a convolutional neural network

Yinuo Huang; Gokul Krishnan; Timothy O’Connor; Rakesh Joshi; Bahram Javidi

doi:10.1364/OE.475537

1. Introduction

Underwater optical signal detection has received much interest among researchers due to the growing aquatic activities such as applications of unmanned underwater vehicles, seafloor exploration, pollution monitoring, and oceanography research [1–8]. Traditionally, many underwater optical signal detection systems achieving good performance are tested in an underwater environment created by a water tank or swimming pool filled with tap water or clear water [4,7,9,10]. However, it is vital to consider phenomena such as scattering, absorption, and turbulence during optical signal transmission, which can degrade the received optical signals and result in poor performance for underwater optical signal detection. Moreover, possible occlusions in the communication channel can lead to the failure of entire optical signal detection systems. Therefore, to have a reliable optical signal detection system, researchers have emphasized studying systems in more challenging environments such as ocean water or turbid harbor [11,12]. Under more challenging environments, the performance of many reported systems may not be capable of achieving the desirable bit error rate without sacrificing the range of communication channels [11,12].

To achieve a robust underwater optical signal detection system, researchers have developed novel imaging-based approaches for optical signal detection, such as the polarimetric-based approach [13] and peplography [14,15]. Compared to conventional optical signal detection, imaging-based methods can capture the intensity of the optical signal and the spatial and angular features of the optical signal. Previously, researchers have proposed image-based optical wireless communication for automotive applications [16,17]. Nowadays, with high-speed cameras which can achieve up to 1,000,000 fps [18], it is possible to build fast and high-performance communication systems. Among imaging-based approaches, 3D integral imaging (3D InIm) approaches show great potential in turbid and occluding underwater environments. Integral imaging is a 3D technique that utilizes multiple cameras to capture multiple perspectives of the scene and uses dedicated algorithms to reconstruct 3D images. It has applications in remote sensing, night vision, and human-machine interaction, and it has great advantages in occlusion removal and noise reduction in various degraded environments [19–22]. Promising performance in signal detection under a turbid environment has been reported using 4D correlation-based multidimensional InIm [23] and signal detection using multidimensional InIm with deep learning [24]. In both approaches, using 3D InIm plus temporal encoding rather than conventional temporal sensing can improve the detection performance in turbid and partially occluded environments.

There are challenges with implementing InIm-based approaches, such as computational time due to the requirement of 3D scene reconstruction, depth estimation of the temporally modulated light source, and accurate camera calibration for the camera array setup. In conventional InIm, accurate camera calibration is critical to the quality of 3D scene reconstruction. Specifically, we need to know parameters of the camera systems, such as focal length, image sensor dimensions, camera resolution, etc. in order to perform camera calibration. Parameters such as image translation and transformation between different cameras in the array are computed, and those parameters account for the misalignment between which are incorporated in performing 3D scene reconstruction. To increase the speed of the overall processing pipeline and reducing hardware complexity while maintaining detection performance, we propose an end-to-end integrated signal detection pipeline that uses the principle of 3D InIm to capture angular and intensity of ray information using a 1D array of cameras but without the computational burden of full 3D scene reconstruction and depth estimation of the temporally modulating light source. The captured spatial and temporal data are processed using a dedicated Convolutional Neural Network (CNN) for signal detection. We denote this approach as 1DInImCNN as it is implemented with a 1D-camera-array-based InIm image capture with integrated CNN processing for signal detection. As discussed previously, unlike conventional InIm this approach does not perform a 3D scene reconstruction. This end-to-end integrated underwater optical signal detection pipeline trained with CNN could learn temporal and spatial relationships from different perspectives of the 1D camera array, thus there is no requirement for accurate camera calibration.

In our experiments, an optical source transmits temporal-encoded optical signals using Gold code. A 1D camera array captures the intensity and angular information of optical signals, and recorded videos are inputs to the CNN for signal detection. The CNN comprises a series of layers, including 3D Convolutional layers, 3D Max-pooling layers, and fully connected layers, and the network can learn spatiotemporal and multi-directional features from the captured images. The output of CNN provides the information of transmitted symbols. In this paper, we demonstrate that the proposed 1DInImCNN may outperform previously-proposed 3D-InIm-based deep learning approach [24] with full 3D scene reconstruction in computational cost and performance. Here, we denote the previously proposed InIm-based deep learning approach that performs 3D reconstruction of the scene as 3D InIm-based CNN-BiLSTM in order to be consistent with the notation used in the previously reported paper [24]. For the sake of comparison with the proposed 1DInImCNN, the data capture is done using 1D array of cameras and then processed by 3D InIm CNN-BiLSTM which consists of 3D scene reconstruction.

The computational cost of various approaches such as the proposed integrated 1DInImCNN, conventional 3D InIm with 1D camera arrays and 3D scene reconstruction using Bi-directional Long Short-Term Memory network (CNN-BiLSTM) is measured using floating point operations (FLOPs). Also, the detection performances of both systems are measured by Matthew Correlation Coefficient (MCC). In summary, the proposed 1DInImCNN carries advantages in that 1) it is an end-to-end integrated pipeline that can be optimized using deep learning for signal detection, 2) it does not require any 3D reconstruction and depth estimation of the light source, and 3) camera calibration is not required.

The rest of the paper is organized as follows: Section 2 covers a brief review of the 3D InIm approach and the architecture of the 1DInImCNN. The experimental method is included in Section 3. Results and discussions are presented in Section 4. Finally, Section 5 concludes this paper.

2. Methodology

The objective of 1DInImCNN is to achieve optical signal detection with good classification performance and lower computational cost at the receiver’s end compared to previously proposed 3D InIm with CNN approach in scattering and partially-occluded underwater environment. In the experiments, a light-emitting diode (LED) which operates at the wavelength of 640 nm is used to transmit the optical signal. We utilize single-channel pulse width modulation (PWM) as the modulation type and use 7-bit gold code as the encoding scheme. More details regarding the encoding scheme are included in Section 3. A 1D camera array is used as the receiver in the optical detection system. The frequency of the transmitter is synchronized with the frame rate of the cameras. We have used three different frequencies: 20, 60, and 120 Hz. We compare two imaging-based approaches for signal detection: our proposed 1DInImCNN and previously proposed 3D InIm with CNN in terms of classification performance and computational cost at the receiver's end. We mainly study multi-imaging-based approaches in underwater optical signal detection because of their improved performance in scattering and partially-occluded underwater environments compared to conventional photodiodes-based detection [23,24]. A description of 3D InIm with CNN and proposed 1DInImCNN is included in the following section.

2.1 3D InIm-based approach

Integral Imaging is a passive 3D imaging technique proposed by Lippman in 1908 [25]. In conventional 3D InIm, a 2D micro-lens array or 2D camera array is utilized to capture multiple elemental images which contain multi-directional information about a scene [22]. Then each elemental image is back-projected through a virtual pinhole array according to ray-optics to reconstruct the 3D scene. A slice of the 3D scene can be reconstructed at the desired depth. In our approach, we use a 1D variation of InIm. Figure 1 shows the camera pick-up stage and 3D InIm reconstruction stage using a 1D camera array. Mathematically, the 3D scene reconstruction is roughly described as :

(1)$$I({x,y;z;t} )= \frac{1}{{O({x,y;z;t} )}}\mathop \sum \nolimits_{n = 0}^{N - 1} \mathop \sum \nolimits_{m = 0}^{M - 1} E{I_{n,m}}\left( {x - m\frac{{{N_x}{P_x}f}}{{{C_x}z}},y - n\frac{{{N_y}{P_y}f}}{{{C_y}z}};t} \right)$$

In Eq. (1), ${N_x}$ and ${N_y}$ are the number of pixels in $the\; x$ and y directions, respectively, f represents the focal length of the camera lens, and z is the reconstruction depth. N and M represent the total number of elemental images in the x and y direction, respectively. ${P_x}$ and ${P_\textrm{y}}$ are the pitch of the adjacent image sensors. ${C_x}$ and ${C_\textrm{y}}$ are the physical sizes of the image sensors. $\textrm{O}({\textrm{x},\textrm{y};\textrm{z};\textrm{t}} )$ is a matrix storing the number of overlapping pixels at each time frame t.

Fig. 1. 3D InIm implemented by a 1D camera array: (a) Pick-up stage of InIm. (b) 3D InIm reconstruction for 1D InIm. InIm: Integral imaging

Download Full Size | PDF

One of the previously reported methods uses a CNN with Bi-directional long short-term memory (BiLSTM) to detect the optical signals in turbidity in the 3D reconstructed videos [24]. The CNN-BiLSTM network is composed of a pre-trained convolutional neural network for spatial feature extraction and Bi-directional long short-term memory layers used for temporal feature extraction. In the previous approach, state-of-the-art GoogLeNet is pre-trained on popular image datasets such as ImageNet [26]. After feature extraction, the spatial features are inputs to the BiLSTM to learn temporal information. A BiLSTM layer has two LSTM layers that learn in both forward and backward directions. Finally, a fully connected layer, a SoftMax layer, and a classification layer located at the output of the CNN-BiLSTM network are used to classify different signals.

2.2 1D InIm with convolutional neural network (1DInImCNN)

The proposed 1DInImCNN is a deep neural network that uses a 1D array of videos as inputs. A 1D array of cameras captures the angular and intensity information of the transmitted optical signals. Unlike conventional 3D InIm, the proposed approach does not require 3D reconstruction and depth estimation of the scene or the light source in signal detection. Figure 2 and Appendix A present the architecture and details of the proposed approach. The architecture of the proposed 1DInImCNN consists of input layers, a depth concatenation layer, and a series of 3D convolution, pooling, and batch normalization layers. Firstly, for input layers, the dimension of input layers is $[{x,\; y,\; d,\; c} ]$, where x and y are the video's pixel resolution (number of pixels in each elemental image), d is equal to 7, which is the length of the signal encoding scheme utilized, and c is the number of color channels. $c$ is equal to 3 because the image sensors from our cameras are standard RGB color sensors. More details about the encoding scheme are explained in Section 3. The approach used to aggregate videos from different perspectives of a 1D array of cameras is essential for the performance of the 1DInImCNN. There exist many ways to combine multi-perspective information using neural networks. We experimented with different possible approaches such as elemental-wise maximum operation, addition of all elemental images (all camera images) operations, concatenation in the 4^th dimension (color dimension), and concatenation in the temporal domain (depth concatenation) during the design phase of network architecture. Our experiments revealed that the temporal domain concatenation yielded better performance than the other approaches. Temporal domain concatenation is achieved by concatenating the input videos in the 3^rd (temporal) dimension. For example, temporal concatenation of N number of input videos, each has the dimension of $[{x,\; y,\; d,\; c} ]$, results in an output matrix that has the dimension of $[{x,\; y,\; N\ast d,\; c} ]$. Simple tests about the location of the temporal concatenation layer in a 1DInImCNN reveal that the concatenation layer located after the input layers (that is, early in the network) is most beneficial for our data type. One advantage of early temporal concatenation in the network is that it reduces the number of parameters in the neural network, which speeds up training as well as testing speeds. Also, through our testing, early temporal concatenation with sufficient training results in improved classification.

Fig. 2. 1DInImCNN architecture

Download Full Size | PDF

The remaining architecture of 1DInImCNN starts with a 3D convolutional layer using 32 kernels with a kernel size of 7${\times} $7${\times} $7 in a stride of [5,5,7]. The first 3D convolutional layer is designed to extract spatiotemporal features from each input because of the 7-bit encoding scheme we used. All the 3D convolutional layers are followed by a batch normalization layer [27] and ReLu activation layers. Then a 3D max-pooling layer, with a kernel size of 4${\times} $4${\times} $4 in the stride of [2,2,2], follows to reduce the vector size. Two 3D convolutional layers follow to learn high-level features between inputs from different perspectives. Finally, a 3D convolutional layer, having 64 kernels of size 1${\times} $1${\times} $1 in the stride of [1, 1, 1], is used to reduce the number of parameters and to learn details of the features. Then a fully connected layer of 3 neurons, a SoftMax layer, and a classification layer follow for the classification of the three classes of the signal. The Adam optimizer is used for the training of the 1DInImCNN [28].

As an end-to-end pipeline, the multi-perspective videos of the transmitted signal are directly fed into the 1DInImCNN, which outputs class conditional probabilities for each of the classes considered: ‘class 0’ corresponding to bit ‘0’, ‘class 1’ corresponding to bit ‘1’ and ‘class idle’ corresponding to the idle state of the system. As shown in Fig. 2, the number of input signals is not restricted to three. Designed networks are capable of learning features from training data and classifying testing data from at least one input up to nine inputs or more. The network with one input (an elemental image) is called a single-camera 1D integral imaging convolutional neural network, and the depth concatenation layer is not included in the architecture. In experiments, many combinations of input dimensions were tested. We used minimal resolution 240(H) ${\times} $ 240(V) number of pixels for data collection since it is the smallest resolution that the light source appears in the field of view of all cameras. Resolution of 1600(H) ${\times} $ 1200(V) is the maximum resolution our cameras could support. Therefore, because of these experimental constraints, designed networks were tested from resolution 240(H) ${\times} $ 240(V) to 1600(H) ${\times} $ 1200(V). Furthermore, we tested varied network depths of the 1DInImCNN to find the optimal number of layers for the networks. We found that 1DInImCNN, with the architecture in Appendix A, may be the smallest network that is still capable of learning features and of successful classification in signal detection.

3. Experiment methods

3.1 Experiment setup and data collection

Figure 3 shows the experimental setup. We used a light-emitting diode (LED) operating at the wavelength of 640 nm to transmit the optical signal, which uses a 7-bit encoding scheme from gold code. To be specific, symbol of “1” is encoded by [1, 1, 0, 0, 1, 0, 1] and “0” is encoded by the flipped gold code sequence of [0, 0, 1, 1, 0, 1, 0]. A water tank of dimension 500 $mm$ (W) ${\times} $ 250 $mm$ (L) ${\times} $ 250 $mm$ (H) is placed in front of the LED to mimic the underwater environment. Controlled amounts of anti-acid are added to create different levels of turbidity. Beer’s coefficient is adopted to measure the turbidities in each underwater environment. From Beer-Lambert’s law, $I = {I_o}{e^{ - \alpha z}}$, and $\alpha $ is the attenuation coefficient or Beer’s coefficient in units of $m{m^{ - 1}}$. Calculated Beer’s coefficient $\alpha $ accounts for the total attenuation from both scattering and absorption in water. ${I_o}$ is the initial light intensity of the light source. I is the intensity of the light after traveling z distance in the underwater medium. The optical signal passing through the water tank, filled with clear water or turbid water, will be captured by our camera array. The camera array consists of 9 G-192 GigE cameras and C-mount zoom lens. The maximum resolution recorded by cameras is 1600(H) ${\times} $ 1200(V). The pitch between each camera is 80 mm. Compared to the previously adopted two-dimensional arrangement of 3 ${\times} $ 3 camera array [23,24], we used a one-dimensional arrangement of 1 ${\times} $ 9 camera array to have better longitudinal depth resolution [29].

Fig. 3. Setup for the underwater optical signal detection experiment with 1D Integral imaging convolutional neural network (1DInImCNN). The optical signal is captured by a 1D camera array. Turbid water in the tank is used to mimic the underwater environment.

Download Full Size | PDF

To study the influence of changing resolution on computational cost at receiver’s end for various approaches (3D InIm with CNN-BiLSTM vs integrated pipeline 1DInImCNN), data collection has been done in three different resolutions: 240(H) ${\times} $ 240(V), 500(H) ${\times} $ 500(V), and 1600(H) ${\times} $ 1200(V). Two advantages lie in using smaller resolutions for signal detection. Firstly, the processing time for networks is lowered, and secondly, the camera array could support a higher frame rate resulting in a higher data rate for underwater optical signal detection. When the camera array records in 240(H) ${\times} $ 240(V) resolution, the maximum achievable frame rate with our cameras is 124. Thus, a frame rate of 120 fps is adopted for data collection. It should be noted that digital cameras with mega frame rates are available [18]. We synchronize the frequency of the LED on-off blinking due to modulation with the frame rate of cameras, which ensures that each frame in videos captures either one “on” or “off” image. Synchronizing the frequency of the LED temporal modulation with the frame rate of cameras is a necessary technique for an image-based camera communication to recover the bits from the captured images [30]. When the camera array records in 500(H) ${\times} $ 500(V) and 1600(H) ${\times} $ 1200(V) resolutions, the maximum achievable frame rate is 64 and 23, respectively, and the frame rate we adopted is 60 and 20, respectively. Moreover, the number of videos adopted from different perspectives can also influence the computational cost for both approaches. More groups of data can be generated for a specific resolution by selecting videos from subsets of cameras. Figure 3 shows the camera number. For example, in the case of 3-input 1DInImCNN, we select videos from cameras 1, 5, and 9 for further data processing. In the case of 5-input 1DInImCNN, recorded videos from cameras 1, 3, 5, 7, and 9 are used. In the case of 7-input 1DInImCNN, recorded videos from cameras 1, 2, 3, 5, 7, 8, and 9 are used. For consistency, conventional 3D InIm with the CNN-BiLSTM approach uses the same choices of cameras for image reconstruction during processing. Therefore, in one resolution, we have a total of 4 distinct combinations of cameras. Similarly, procedures are repeated for three different resolutions, namely 240(H) ${\times} $ 240 (V), 500(H) ${\times} $ 500 (V), and 1600(H) ${\times} $ 1200(V). Thus, we have 12 different configurations for 1DInImCNN and 3D InIm with CNN-BiLSTM.

Figure 4(a) shows the setup we used for training data collection in clear and turbid water without occlusion. During data collection, the light source is placed within the field of view of the cameras. A water tank without occlusion is placed in front of the LED to mimic the underwater environment. To increase the generalization capability of networks and to enhance the diversity of training data, we have recorded the data in both clear and various levels of turbid waters in each resolution. Table 1 shows Beer's coefficients. In each level of turbidity, symbols sent by LED are [1, 0, 0, 1, 1, 0, 1, 0], and this process is repeated four times to collect 32 symbols and each symbol is encoded by 7-bits encoding scheme. Therefore, a total of 224-bit optical signals are collected by cameras. Under the same turbidity, LED is shifted to different random locations during data collection to increase the training data's diversity further. Moreover, to reduce experimental complexity, we utilize augmentations, including random vertical reflection, X and Y translation on training data.

Fig. 4. Experimental setup of 1DInImCNN system in Fig. 3 using a water tank during data collection in the presence of turbidity and occlusion. (a): experimental condition during the collection of training data. (b): experimental condition during the collection of testing data with occlusion. (c-d) : An example image of training data taken from camera 5 with resolution 240(V) ${\times} $ 240(H) at $\alpha $=0.0095 and $\alpha $=0.0196, respectively. (e-f): An example image of testing data taken from camera 5 in resolution 240(V) ${\times} $ 240(H) at $\alpha $=0.0064 and $\alpha $=0.0170, respectively.

Download Full Size | PDF

Table 1. Turbidity levels (α in $m{m^{ - 1}}$) for training data in different resolution

View Table | View all tables in this article

For a continuous stream of incoming video data, we need to slice the incoming data into 7-frame video sequences in order to feed it to a classifier. For slicing the incoming data, we utilized a sliding window approach. While using a sliding window approach to find all possible combinations of 7-bit signal sequences, there are a total of 36 maximum possible combinations of sequences that arise. Among 36 combinations, there is 1 gold code, 1 flipped gold code, and 34 other possibilities. Therefore, we can divide training data into three classes: class 1, class 0, and class idle. The idle class corresponds to the 34 different possibilities which don’t fall to either class ‘0’ or class ‘1’. For collecting training data, we transmitted 32 symbols of the signal, each symbol coded with a 7-bit gold code encoding scheme. While applying the sliding window approach to the recorded data, we get 16 videos corresponding to class ‘1’, 16 videos corresponding to class ‘0’, and 34 unique video sequences corresponding to class ‘idle’. To deal with the imbalanced dataset, random minority oversampling is utilized for class 0 and class 1 to make the number of videos even. Thus, we get 34 ${\times} $ 3 = 102 videos for a specific turbidity level, resolution, and LED position. We have repeated the same procedure in 5 different turbidities and 2 different LED positions, making a total number of 1020 videos.

For collecting testing data, we followed a similar procedure to that of training data except that the turbidity levels are different, the light source is randomly placed, and partial occlusion is created by randomly putting plants inside the tank. Table 2 shows Beer's coefficient for testing data in different resolutions. Figure 4(b) shows the setup we used to collect testing data. Specifically, 32 symbols of the signal are collected at each turbidity level in each resolution. While using the sliding window approach to generate 7-frame videos, for each case, we get 16 videos corresponding to class 0, 16 videos corresponding to class 1, and 136 class idle videos. Thus, in total, for testing data in each case, we have 168 videos. We repeat the same procedure for 5 turbidity levels in 3 different resolutions. In [24], the authors applied the sliding-window-based detection to the testing data for the classification score. The summation of the prominence based on the encoding scheme on the classification score was used to decode the signal and to show the detection performance. However, as an end-to-end integrated pipeline, 1DInImCNN does not require this preprocessing step in signal detection.

Table 2. Turbidity levels (α in $m{m^{ - 1}}$) for testing data in different resolutions. α is Beer's coefficient.

View Table | View all tables in this article

In previous underwater signal detection work [24], conventional 3D InIm reconstruction was required. Thus, the accuracy of the estimated depth from the camera sensor to the object of interest influences the quality of 3D reconstructed images. 3D reconstructed images at precise depth will significantly increase the visibility of the light source in partial occlusion conditions and hence improve the classification performance. If the depth information (source range) is known a priori [24], it can substantially simplify the computational process for long-range objects. However, for real-world scenarios, it may not be realistic to assume that the source range is known a priori. Thus, for performance evaluation of the previously reported underwater signal detection using conventional 3D InIm, we assume that the light source location is not known. For this case, we used minimization of the statistical variance of spectral radiation pattern (SRP)- depth estimation algorithm [31] for reconstructing the light source along with the CNN-BiLSTM to analyze signal detection performance. This alleviates any prior information influencing the detection performance and provides a more accurate representation of the total computation requirements when using the conventional 3D InIm approach with CNN-BiLSTM. Our proposed integrated pipeline 1DInImCNN does not require known source range a-priori, unlike the previously presented 3D InIm with CNN-BiLSTM approach, which results in substantially improved computational performance.

4. Result and discussion

4.1 Theoretical analysis of computational cost

For underwater signal detection, computational cost at receiver’s end and detection performance are evaluated. Firstly, floating point operations (FLOPs) are used to compare the computational cost between the two approaches, 1) conventional 3D InIm requiring the 3D reconstruction of the light source, and 2) the integrated pipeline 1DInImCNN which does not require 3D source reconstruction. FLOPs can be used to calculate how many computations a model has made. It is widely used to measure neural network complexity [32,33]. The basic operations we used, such as addition, subtraction, multiplication, and division between two scalers, can be regarded as 1 FLOP. Also, some other operations, such as min, and max operation between two scalers, can be approximated as 1 FLOP. To calculate the FLOPs for 1DInImCNN, we have derived the equations for 3D convolutional layers, ReLu activation layers, 3D max-pooling layers, and fully connected layers as:

(2)$$FLOP{s_{3D\,cov}} = {C_i}\mathrm{\ast}\,{K_w}\mathrm{\ast }{K_h}\mathrm{\ast }{K_d}\mathrm{\ast }2\mathrm{\ast }{C_o}\mathrm{\ast }W\mathrm{\ast }H\mathrm{\ast }D$$

(3)$$FLOP{s_{3D\,maxpool}} = H\mathrm{\ast }W\mathrm{\ast }D\mathrm{\ast\,}{C_o}$$

(4)$$FLOP{s_{fully\,connected}} = 2\mathrm{\ast }I\mathrm{\ast }O$$

(5)$$FLOP{s_{Relu\,activation}} = W\mathrm{\ast }H\mathrm{\ast }D\mathrm{\ast }{C_o}$$

In Eqs. (2)–(5), ${K_w}$, ${K_h}$, and ${K_d}$ are filter width, height, and depth, respectively. ${C_i}$ and ${C_o}$ are the numbers of channels for input and output vectors, respectively. W, H, $and\; D$ are the width, height, and depth of the output vector. I and O are the sizes of the input vector and output vector, respectively. For 3D InIm with CNN-BiLSTM, we have used a depth search range of 8 meters with a step interval of 0.05 meters for the 3D InIm depth estimation via the SRP-based depth estimation algorithm in every configuration. The choice of depth range and step interval mainly depend on the computational power of the processor, and it is also related to the field of view of the cameras and the physical dimension of the optical transmitter. Our depth range and step interval choice are based on the experimental setup. In a real underwater environment, to have high accuracy, the choices of a larger depth range and a smaller step interval compared to the choices for the laboratory setup will result in more FLOPs in computations.

Figure 5(a) depicts FLOPs for 1DInImCNN and conventional 3D InIm with CNN-BiLSTM. Each approach has 12 configurations, and it is shown that 1DInImCNN outperforms depth-estimated 3D InIm with CNN-BiLSTM in all configurations. For Fig. 5(b), we have plotted the FLOPs ratio to show the speed difference in each configuration. FLOPs ratio is calculated through Eq. (6). We could achieve approximately 35 times faster than depth estimated 3D InIm with CNN-BiLSTM by using 1DInImCNN under the same configuration. In Figs. 5(c) and 5(d), we show the FLOPs results using a depth search range of 100 meters with a step interval of 0.05 meters, and 1DInImCNN could achieve around 215 times faster than conventional 3D InIm with CNN-BiLSTM. Therefore, for underwater communication in the field where the light source has a long range, the 1DInImCNN will have much better performance than the conventional 3D InIm with CNN-BiLSTM because much greater computations are needed for the 3D reconstruction of the light source.

(6)$$FLOPs\,Ratio = \frac{{FLOP{s_{estimated\,depth\,3D\,InIm + CNN - BiLSTM\,}}}}{{FLOP{s_{1DInImCNN}}}}$$

Fig. 5. (a): Number of FLOPs for Lab setups of depth estimated 3D InIm with CNN-BiLSTM and 1DInImCNN in every configuration. The X-axis shows the configurations. For example, 1${\times} $5 1200 represents videos recorded in resolution 1200(V) ${\times} $ 1600(H) from cameras 1, 3, 5, 7, and 9. (b) FLOPs ratio between 1DInImCNN and depth estimated conventional 3D InIm with CNN-BiLSTM in every configuration. (c-d): FLOPs results using a depth search range of 100 meters with a step interval of 0.05 meters. The 1DInImCNN could achieve about 215 times faster. For underwater communication in the field where the light source has a long range, the 1DInImCNN will have a much better performance than the conventional InIm CNN-BiLSM.

Download Full Size | PDF

4.2 Experimental results of detection performance

In every configuration, we select the optimal hyperparameters using Bayesian optimization for both 1DInImCNN and 3D InIm CNN-BiLSTM. The objective for the optimization is to maximize each network’s classification accuracy on the validation dataset, which has a size of 10% of the training dataset. The initial learning rate is optimized in a range between 0.001 and 0.01 on a log scale. Mini batch size is optimized in a range between 2 to 128. The max epochs for training are optimized in a range between 20 and 40. The learning rate drop factor is optimized in a range between 0.1 and 0.4. The learning rate drop period is optimized in a range between 7 and 15. The optimized mini-batch size we have for 1DInImCNN across all configurations is 128. We utilized 3 computers to train all networks and their configurations are: 1) Intel Xeon CPU E5-2640 v4 and Nvidia Quadro RTX 6000, 2) Intel i9-10940X CPU and Nvidia Quadro RTX 6000, 3) Intel i9-10940X CPU and Nvidia Quadro RTX 6000.

Receiver operating characteristic (ROC), Area under the curve (AUC), and confusion matrix are universal metrics to evaluate the performance of classifications. However, comparing the classification performance between two approaches across 12 configurations requires an informative and concise metric to describe multi-class classification with imbalanced testing data. If we want to incorporate the confusion matrix as the main metric, it is difficult for us to clearly conclude which method is better under many configurations in the two approaches. ROC and AUC can be used in evaluating the performance of the multi-class classification if we use the One-vs-Rest strategy. Nevertheless, Matthew Correlation Coefficient (MCC), compared to the ROC and AUC, is more suitable and descriptive in the case of the imbalance testing dataset in the multi-class problem [34,35], and it is widely used in the literature [36–38]. The value of MCC ranges from -1 to 1, with -1 representing the worst classification, 0 meaning a random classification, and 1 representing a perfect classification. To give a sense of how good the classification is for given MCCs, we have included the confusion matrixes, ROC, and AUC for one configuration (3 inputs and 240(V) × 240(H)) in Figs. 6–9. Figure 7 and 9 show the ROC and AUC for both approaches in one configuration across different turbidites and Fig. 6 and 8 show the confusion matrixes for both approaches. As we can see from Fig. 6 and 8, classes 0 and 1 are misclassified as class idle in high turbidites. A possible reason is that the idle class includes all other 34 possible combinations of the signal sequence, and there exist sequences which are very similar to the sequences in class 1 and 0.

Fig. 6. Confusion matrixes for proposed 1DInImCNN using 3 inputs with 240(V) ${\times} $ 240(H) configurations. a) Tested in α= 0.0002 turbidities. MCC = 0.808. b) Tested in α= 0.0064 turbidities. MCC = 1.c) Tested in α= 0.0129 turbidities. MCC = 0.981. d) Tested in α= 0.0170 turbidities. MCC = 0.768. e) Tested in α= 0.0290 turbidities. MCC = 0.331.

Download Full Size | PDF

Fig. 7. ROC and AUC for proposed 1DInImCNN using 3 inputs with 240(V) ${\times} $ 240(H) configurations. a) Tested in α= 0.0002 turbidities. MCC = 0.808. b) Tested in α= 0.0064 turbidities. MCC = 1. c) Tested in α= 0.0129 turbidities. MCC = 0.981. d) Tested in α= 0.0170 turbidities. MCC = 0.768. e) Tested in α= 0.0290 turbidities. MCC = 0.331.

Download Full Size | PDF

Fig. 8. Confusion matrixes for 3D InIm CNN-BiLSTM using 3 inputs with 240(V) ${\times} $ 240(H) configurations. a) Tested in α= 0.0002 turbidities. MCC = 0.696. b) Tested in α= 0.0064 turbidities. MCC = 0.903.c) Tested in α= 0.0129 turbidities. MCC = 0.838. d) Tested in α= 0.0170 turbidities. MCC = 0.267. e) Tested in α= 0.0290 turbidities. MCC = -0.012.

Download Full Size | PDF

Fig. 9. ROC and AUC for 3D InIm CNN-BiLSTM using 3 inputs with 240(V) ${\times} $ 240(H) configurations. a) Tested in α= 0.0002 turbidities. MCC = 0.696. b) Tested in α= 0.0064 turbidities. MCC = 0.903.c) Tested in α= 0.0129 turbidities. MCC = 0.838. d) Tested in α= 0.0170 turbidities. MCC = 0.267. e) Tested in α= 0.0290 turbidities. MCC = -0.012.

Download Full Size | PDF

Classification of the test data will generate five MCCs for each configuration. A box plot is shown in Fig. 10 to summarize the performance with overall turbidity levels across all configurations. Table 3 compares statistics between the two tested methods. Table 3 and Fig. 10 show that the median for 1DInImCNN is much better than the median in estimated depth 3D InIm with CNN-BiLSTM. For MCCs in the 25^th percentile to the median, 1DInImCNN can maintain a value higher than 0.848, whereas the depth estimated 3D InIm with CNN-BiLSTM can only have 0.042 at a level of random classifiers, as its 25^th percentile. The outliers in Fig. 10 are mainly from classification in high turbidity ($\alpha > 0.0254\; m{m^{ - 1}}$), where both approaches cannot successfully classify the signal in severe scattering and occlusion. Therefore, from the statistics of MCCs, 1DInImCNN is more robust in optical signal detection compared with estimated depth 3D InIm with CNN-BiLSTM.

Fig. 10. A boxplot summarizes Matthew correlation coefficient (MCCs) for both approaches. Each box contains 60 MCCs. The upper and lower blue bound indicate the 25^th and 75th percentile of MCCs. The Red line in the box indicates the median of MCCs. Red plus signs indicate outliers. Important statistics in the boxplot are included in Table 3. Outliers are mainly from classification in high turbidity.

Download Full Size | PDF

Table 3. Statistics from the Matthew correlation coefficient (MCC) boxplot for both approaches

View Table | View all tables in this article

In Fig. 11, we have shown average MCC values, calculated from all configurations, of tested approaches as a function of turbidity. Average Beer’s coefficient $\bar{\boldsymbol{\alpha }}$ is used to represent the turbidity level. In each turbidity level, the average MCC is calculated by averaging MCCs from all configurations. Also, two more approaches that use only 2D information, that is Single-camera 1DInImCNN and 2D imaging plus CNN-BiLSTM, are added for comparison. The single-camera 1DInImCNN has a single video input from our center camera (camera 5). The rest of the architecture has the same structure as the 1DInImCNN. Similarly, 2D conventional imaging using the center camera with CNN-BiLSTM is added for comparison. From Fig. 11, it is shown that 1DInImCNN has much better average MCCs than other approaches across various turbidities. For $\bar{\boldsymbol{\alpha }}$ at 0.0174 and 0.0271, the average MCC from 1DInImCNN is significantly higher than the average MCC from depth estimated 3D InIm with CNN-BiLSTM. Also, another interesting observation from Fig. 11 is that the average MCC at $\bar{\boldsymbol{\alpha }}$ = 0.0004 (no turbidity) is lower than the turbidity at $\bar{\boldsymbol{\alpha }}$ = 0.0067 (low levels of turbidity) across all approaches. This can be explained by the fact that the presence of occlusion blocks the cameras from capturing the light source. The presence of low turbidity scatters the light in a way that the cameras can capture more of the light source. That is, with slight turbidity in the water, the occluded light source becomes more visible to the cameras due to the scattering effect. As the turbidity increases in water, the performance of classifiers starts to degrade.

Fig. 11. Average MCCs for the proposed integrated pipeline 1DInImCNN, conventional 3D InIm with CNN-BiLSTM, Single-camera 1DInImCNN, and 2D Imaging with CNN-BiLSTM across different levels of turbidities in the presence of occlusion. An average MCC is calculated by averaging MCCs from all configurations in a specific approach.

Download Full Size | PDF

Based on the experimental results, we could conclude that 1DInImCNN may outperform 3D InIm with CNN-BiLSTM in detection performance. We think there are two reasons why 1DInImCNN has a better performance compared to 3D InIm with CNN-BiLSTM. Firstly, the previously proposed algorithm (3D InIm CNN-BiLSTM) is not an end-to-end architecture. Conventional InIm requires camera calibration. Also, at the receiver end, we need to perform depth estimation, 3D InIm reconstruction, and classification in different stages, and the errors accumulated at each stage would add up and influence the final detection results. While the proposed architecture is end-to-end, and we are trying to maximize the performance given the input data, it might have the advantage of better performance as compared to the previously proposed conventional InIm system. Lastly, previous 3D InIm with CNN-BiLSTM method applies the spatial extraction first, using pre-trained GoogLeNet, and then uses BiLSTM cells to learn the temporal information. There is an order of learning the spatial information and temporal information. However, for our proposed approach, 1DInImCNN learns the spatial and temporal information at the same time in one network, and this may help increase the performance of optical signal detection.

5. Conclusion

In conclusion, we have presented an underwater optical signal detection using an integrated end-to-end approach with 1D InIm signal capture and convolutional neural network which we denote as 1DInImCNN. The signal capture stage is implemented with a 1D array of cameras in an InIm-inspired setup to obtain spatial and angular information of the rays. Unlike conventional 3D InIm, which requires computationally intensive 3D scene reconstruction [39] and camera calibration, the proposed 1DInImCNN uses an integrated pipeline to use the captured 1D array of videos for signal detection using CNN and without 3D scene reconstruction or camera calibration. We have compared our proposed 1DInImCNN and the previously reported 3D InIm with the CNN-BiLSTM approach in terms of detection performance and computational cost at receiver’s end in challenging underwater conditions such as occlusion and turbidity. The results show that 1DInImCNN may outperform other tested methods in classification performance. The reasons could be that for conventional InIm, there may be sources of error due camera calibration estimation, 3D scene reconstruction, and depth estimation. These sources of errors may be avoided using the proposed 1DInImCNN as none of these estimations are required. Furthermore, the 1DInImCNN, compared to 3D InIm with CNN, requires less computational complexity and cost. Future work could focus on developing CNN which can be used for speedier classification and implementation with high frame rate cameras.

Appendix A: 1DInImCNN architecture Table

Table 4 presents the architecture and details of the 1DInImCNN.

Table 4. Architecture of 1DInImCNN^a

View Table | View all tables in this article

Funding

U.S. Department of Education (GAANN Fellowship); Air Force Office of Scientific Research (FA9550-21-1-0333); Office of Naval Research (N000141712405, N000142012690, N000142212349, N000142212375).

Acknowledgments

We wish to acknowledge support under The Office of Naval Research (ONR) (ONR N000142212349; N000141712405, N000142012690, N000142212375); Air-Force Office of Scientific Research (AFOSR) (FA9550-21-1-0333). T. O'Connor acknowledges support via the GAANN fellowship through the Department of Education. G. Krishnan acknowledges the support via GE fellowship.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time.

References

1. W. Liu, Z. Xu, and L. Yang, “SIMO detection schemes for underwater optical wireless communication under turbulence,” Photonics Res. 3(3), 48–53 (2015). [CrossRef]

2. X. Liu, S. Yi, X. Zhou, Z. Fang, Z.-J. Qiu, L. Hu, C. Cong, L. Zheng, R. Liu, and P. Tian, “34.5 m underwater optical wireless communication with 2.70 Gbps data rate based on a green laser diode with NRZ-OOK modulation,” Opt. Express 25(22), 27937–27947 (2017). [CrossRef]

3. P. Tian, X. Liu, S. Yi, Y. Huang, S. Zhang, X. Zhou, L. Hu, L. Zheng, and R. Liu, “High-speed underwater optical wireless communication using a blue GaN-based micro-LED,” Opt. Express 25(2), 1193–1201 (2017). [CrossRef]

4. J. Wang, C. Lu, S. Li, and Z. Xu, “100 m/500 Mbps underwater optical wireless communication using an NRZ-OOK modulated 520 nm laser diode,” Opt. Express 27(9), 12171–12181 (2019). [CrossRef]

5. A. Celik, N. Saeed, T. Y. Al-Naffouri, and M.-S. Alouini, “Modeling and performance analysis of multihop underwater optical wireless sensor networks,” in 2018 IEEE Wireless Communications and Networking Conference (WCNC) (IEEE, 2018), pp. 1–6.

6. Z. Vali, A. Gholami, Z. Ghassemlooy, M. Omoomi, and D. G. Michelson, “Experimental study of the turbulence effect on underwater optical wireless communications,” Appl. Opt. 57(28), 8314–8319 (2018). [CrossRef]

7. C. Tu, W. Liu, W. Jiang, and Z. Xu, “First Demonstration of 1Gb/s PAM4 Signal Transmission Over A 130 m Underwater Optical Wireless Communication Channel with Digital Equalization,” in 2021 IEEE/CIC International Conference on Communications in China (ICCC) (IEEE, 2021), pp. 853–857.

8. H. Kaushal and G. Kaddoum, “Underwater Optical Wireless Communication,” IEEE Access 4, 1518–1547 (2016). [CrossRef]

9. Feitian bin Tian, Xiaobo Zhang, Tan, “Design and development of an LED-based optical communication system for autonomous underwater robots,” in 2013 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (IEEE, 2013), pp. 1558–1563.

10. K. Nakamura, I. Mizukoshi, and M. Hanawa, “Optical wireless transmission of 405 nm, 1.45 Gbit/s optical IM/DD-OFDM signals through a 4.8 m underwater channel,” Opt. Express 23(2), 1558–1566 (2015). [CrossRef]

11. M. Doniec, I. Vasilescu, M. Chitre, C. Detweiler, M. Hoffmann-Kuhnt, and D. Rus, “AquaOptical: A lightweight device for high-rate long-range underwater point-to-point communication,” in OCEANS 2009 (2009), pp. 1–6.

12. M. Doniec, C. Detweiler, I. Vasilescu, and D. Rus, “Using optical communication for remote underwater robot operation,” in2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (2010), pp. 4017–4022.

13. R. Joshi, G. Krishnan, T. O’Connor, and B. Javidi, “Signal detection in turbid water using temporally encoded polarimetric integral imaging,” Opt. Express 28(24), 36033 (2020). [CrossRef]

14. M. Cho and B. Javidi, “Peplography—a passive 3D photon counting imaging through scattering media,” Opt. Lett. 41(22), 5401 (2016). [CrossRef]

15. S. Komatsu, A. Markman, and B. Javidi, “Optical sensing and detection in turbid water using multidimensional integral imaging,” Opt. Lett. 43(14), 3261–3264 (2018). [CrossRef]

16. S. Nishimoto, T. Nagura, T. Yamazato, T. Yendo, T. Fujii, H. Okada, and S. Arai, “Overlay coding for road-to-vehicle visible light communication using LED array and high-speed camera,” in 2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC) (2011), pp. 1704–1709.

17. I. Takai, S. Ito, K. Yasutomi, K. Kagawa, M. Andoh, and S. Kawahito, “LED and CMOS Image Sensor Based Optical Wireless Communication System for Automotive Applications,” IEEE Photonics J. 5(5), 6801418 (2013). [CrossRef]

18. P. Xia, Y. Awatsuji, K. Nishio, and O. Matoba, “One million fps digital holography,” Electron Lett. 50(23), 1693–1695 (2014). [CrossRef]

19. A. Markman, X. Shen, and B. Javidi, “Three-dimensional object visualization and detection in low light illumination using integral imaging,” Opt. Lett. 42(16), 3068–3071 (2017). [CrossRef]

20. A. Markman and B. Javidi, “Learning in the dark: 3D integral imaging object recognition in very low illumination conditions using convolutional neural networks,” OSA Continuum 1(2), 373–383 (2018). [CrossRef]

21. G. Krishnan, Y. Huang, R. Joshi, T. O’Connor, and B. Javidi, “Spatio-temporal continuous gesture recognition under degraded environments: performance comparison between 3D integral imaging (InIm) and RGB-D sensors,” Opt. Express 29(19), 30937–30951 (2021). [CrossRef]

22. B. Javidi, A. Carnicer, J. Arai, T. Fujii, H. Hua, H. Liao, M. Martínez-Corral, F. Pla, A. Stern, L. Waller, Q.-H. Wang, G. Wetzstein, M. Yamaguchi, and H. Yamamoto, “Roadmap on 3D integral imaging: sensing, processing, and display,” Opt. Express 28(22), 32266 (2020). [CrossRef]

23. R. Joshi, T. O’Connor, X. Shen, M. Wardlaw, and B. Javidi, “Optical 4D signal detection in turbid water by multi-dimensional integral imaging using spatially distributed and temporally encoded multiple light sources,” Opt. Express 28(7), 10477–10490 (2020). [CrossRef]

24. G. Krishnan, R. Joshi, T. O’Connor, and B. Javidi, “Optical signal detection in turbid water using multidimensional integral imaging with deep learning,” Opt. Express 29(22), 35691–35701 (2021). [CrossRef]

25. G. Lippmann, “La photographie intégrale CR Séances Acad,” (1908).

26. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 248–255.

27. S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei, eds., Proceedings of Machine Learning Research (PMLR, 2015), 37, pp. 448–456.

28. D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv, arXiv:1412.6980 (2014).

29. S. Kishk and B. Javidi, “Improved resolution 3D object sensing and recognition using time multiplexed computational integral imaging,” Opt. Express 11(26), 3528–3541 (2003). [CrossRef]

30. W. Liu and Z. Xu, “Some practical constraints and solutions for optical camera communication,” Philos. Trans. R. Soc., A 378(2169), 20190191 (2020). [CrossRef]

31. M. DaneshPanah and B. Javidi, “Profilometry and optical slicing by passive three-dimensional imaging,” Opt. Lett. 34(7), 1105–1107 (2009). [CrossRef]

32. A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation,” arXiv, arXiv:1606.02147 (2016).

33. Á. Arcos-García, J. A. Álvarez-García, and L. M. Soria-Morillo, “Evaluation of deep neural networks for traffic sign detection systems,” Neurocomputing 316, 332–344 (2018). [CrossRef]

34. D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics 21(1), 6 (2020). [CrossRef]

35. G. Jurman, S. Riccadonna, and C. Furlanello, “A Comparison of MCC and CEN Error Measures in Multi-Class Prediction,” PLoS One 7(8), e41882 (2012). [CrossRef]

36. H. Guo, X. Yang, R. Jing, Y. Li, F. Tan, and M. Li, “Robust multi-class model constructed for rapid quality control of Cordyceps sinensis,” Microchem. J. 171, 106825 (2021). [CrossRef]

37. X. Zhang, M. Z. Akber, and W. Zheng, “Predicting the slump of industrially produced concrete using machine learning: A multiclass classification approach,” J. Build. Eng. 58, 104997 (2022). [CrossRef]

38. H. Son, J. W. Lim, S. Park, B. Park, J. Han, H. B. Kim, M. C. Lee, K.-J. Jang, G. Kim, and J. H. Chung, “A Machine Learning Approach for the Classification of Falls and Activities of Daily Living in Agricultural Workers,” IEEE Access 10, 77418–77431 (2022). [CrossRef]

39. M. Martínez-Corral and B. Javidi, “Fundamentals of 3D imaging and displays: a tutorial on integral imaging, light-field, and plenoptic systems,” Adv. Opt. Photonics 10(3), 512–566 (2018). [CrossRef]

	Clear water	Turbid Level 1	Turbid Level 2	Turbid Level 3	Turbid Level 4
240(H) × 240(V)	0.0005	0.0095	0.0162	0.0196	0.0264
500(H) × 500(V)	0.0002	0.0085	0.0127	0.0191	0.0300
1600(H) × 1200(V)	0.0001	0.0073	0.0130	0.0175	0.0201

	Clear water	Turbid Level 1	Turbid Level 2	Turbid Level 3	Turbid Level 4
240(H) × 240(V)	0.0002	0.0064	0.0129	0.0170	0.0290
500(H) × 500(V)	0.0010	0.0068	0.0117	0.0178	0.0254
1600(H) × 1200(V)	0.0001	0.0069	0.0127	0.0175	0.0269
Mean	0.0004	0.0067	0.0124	0.0174	0.0271

	Minimum	25th percentile	Median	75th percentile	Maximum
Depth estimated 3D InIm with CNN-BiLSTM	-0.076	0.042	0.828	1	1
1DInImCNN	0	0.848	1	1	1

Type	Filter size	Padding	Stride	Output size	Parameters
3D input	-	-	-	$x \times y \times 7 \times 3$	-
N inputs
Depth concatenation	-	-	-	$x \times y \times$ (7N) $\times$ 3	-
3D convolution	7 $\times 7 \times 7$	Same	5,5,7	$[x /5] \times [y /5] \times$ N $\times$ 32	32960
Batch normalization	-	-	-	$[x /5] \times [y /5] \times$ N $\times$ 32	64
ReLU	-	-	-	$[x /5] \times [y /5] \times$ N $\times$ 32	-
3D Max Pooling	4 $\times 4 \times 4$	Same	2,2,2	$[x / 10] \times [y / 10] \times [N/ 2] \times$ 32	-
3D convolution	5 × 5 × 5	Same	3,3,1	$[x / 30] \times [y / 30] \times [N/ 2] \times 64$	256064
Batch normalization	-	-	-	$[x / 30] \times [y / 30] \times [N/ 2] \times 64$	128
ReLU	-	-	-	$[x / 30] \times [y / 30] \times [N/ 2] \times 64$	-
3D convolution	3 $\times 3 \times 2$	Same	2,2,1	$[x / 60] \times [y / 60] \times [N/ 2] \times 128$	147584
Batch normalization	-	-	-	$[x / 60] \times [y / 60] \times [N/ 2] \times 128$	256
ReLU	-	-	-	$[x / 60] \times [y / 60] \times [N/ 2] \times 128$	-
3D convolution	1 × 1 × 1	Same	1,1,1	$[x / 60] \times [y / 60] \times [N/ 2] \times 64$	8256
Batch normalization	-	-	-	$[x / 60] \times [y / 60] \times [N/ 2] \times 64$	128
ReLU	-	-	-	$[x / 60] \times [y / 60] \times [N/ 2] \times 64$	-
Fully Connected	3	-	-	3	3 × ( $x$ × $y$ ×N × 64)/7200 + 3
SoftMax	-	-	-	3	-
Classification	-	-	-	3	-

	Clear water	Turbid Level 1	Turbid Level 2	Turbid Level 3	Turbid Level 4
240(H) × 240(V)	0.0005	0.0095	0.0162	0.0196	0.0264
500(H) × 500(V)	0.0002	0.0085	0.0127	0.0191	0.0300
1600(H) × 1200(V)	0.0001	0.0073	0.0130	0.0175	0.0201

End-to-end integrated pipeline for underwater optical signal detection using 1D integral imaging capture with a convolutional neural network

Abstract

1. Introduction

2. Methodology

2.1 3D InIm-based approach

2.2 1D InIm with convolutional neural network (1DInImCNN)

3. Experiment methods

3.1 Experiment setup and data collection

4. Result and discussion

4.1 Theoretical analysis of computational cost

4.2 Experimental results of detection performance

5. Conclusion

Appendix A: 1DInImCNN architecture Table

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (4)

Equations (6)

Optics Express