Target recognition and segmentation in turbid water using data from non-turbid conditions: a unified approach and experimental validation

Luping Liu; Luping Liu; Xin Li; Xin Li; Jianmin Yang; Jianmin Yang; Xinliang Tian; Xinliang Tian; Lei Liu; Lei Liu

doi:10.1364/OE.524714

1. Introduction

As underwater exploration and resource development continue to advance and expand into broader and deeper regions, the demand for detecting specific underwater targets from complex aquatic environments is increasing. Numerous relevant applications with significant development prospects have emerged in fields including intelligent aquaculture [1,2] and environmental monitoring [3]. Among these, certain complex operational tasks necessitate more precise segmentation of objects and structures within acquired images from sensors, such as accurately grasping targets using robotic arms [4,5] or estimating the size of specific marine organisms [6].

Traditional computer vision methods encounter challenges in these automatic tasks when confronted with complex and murky underwater scenes and targets [7]. Compared to analogous tasks conducted on terrestrial surfaces, underwater optical imaging in natural aquatic environments is subject to quality degradation due to light absorption and scattering phenomena [8]. They manifest as color biases, reduced contrast, blurred target edges, diminished detail, and overall decreased image clarity. These adverse optical effects pose challenges for accurate pixel-level segmentation. Furthermore, the presence of high level dissolved substances and biogenic materials in water exacerbates image degradation by increasing light absorption and scattering coefficients to varying degrees. Differentially manifested and severely compromised underwater visibility renders many conventional segmentation methods impractical and less robust as they often fail to accurately identify and locate the edges of targets [9].

Even with high-power artificial lighting sources assisting camera sensors in these murky waters, effectively enhancing image quality remains difficult. This difficulty arises because illumination at close range simultaneously enhances target information intensity and backscatter interference [10]. While certain image restoration algorithms can enhance image quality, they often require manual tuning and lack robustness across different environmental conditions [11]. Particularly in highly turbid waters, conventional methods struggle to adapt, leading to limited restoration effectiveness.

On the other hand, advancements in computational power and deep learning technology have propelled target detection and segmentation to the next phase. Through deep neural network structures, deep learning models can automatically learn high-level features and understand the complex characteristics and patterns of underwater targets from large datasets [12]. Compared to traditional methods, deep learning models are more adaptable to complex underwater targets and possess stronger generalization capabilities [13]. They can better cope with various challenges in underwater environments, including uneven illumination, water disturbances, and target occlusions. Consequently, the application of deep learning methods in underwater target detection and segmentation is yielding increasingly significant results, providing crucial support for the development of underwater exploration.

However, achieving high-precision target detection and segmentation based on deep learning methods relies on acquiring high-quality datasets [14]. These datasets typically require divers or underwater robots equipped with cameras and other sensors for on-site collection, incurring high costs. The limited availability of underwater datasets presents challenges in meeting engineering and research application demands. Moreover, the turbidity of water bodies fluctuates with changes in factors such as location, time and ocean currents, resulting in dataset instability and decreased segmentation precision [15]. Additionally, despite costly image data acquisition, precise annotation for image labels is challenging due to unclear target edges in images captured from murky water bodies. Therefore, there is currently limited research and engineering application of target detection and segmentation tasks in turbid water bodies, necessitating effective solutions to advance the application scope of underwater operational tasks and intelligent resource development.

In our study, we aim to devise a dependable method for accurately segmenting targets within images taken amidst varying levels of turbidity in water. Central to our investigation is to accomplish this task by training deep learning models on images of the targets obtained in clear water settings, ensuring separation from the test dataset. By utilizing clear water images as the source of the training dataset and applying the trained model to images captured under different levels of turbidity, we circumvent the challenges of obtaining specific turbidity datasets in practical applications and the difficulties in accurately annotating these datasets.

To better align with practical concerns and validate the efficacy of our proposed method, we have opted to focus on the task of segmenting target images in deep-sea mining operations. Polymetallic nodules, also known as manganese nodules, are found distributed in the surface layer of the ocean floor at depths ranging from 3500 to 6000 meters [16], as shown in Fig. 1 (a). These nodules, characterized by their wide distribution and high metal content, are considered crucial mineral resources for the future. Deep sea hydraulic collecting method stands as one of the primary methods for collecting these ore mineral, employing high-speed jets to separate shallow-buried nodules from the seabed and subsequently lifting these nodules to storage containers via pump suction, as illustrated in Fig. 1 (b) and (c) [17]. Advanced adaptive ore collection technologies are capable of autonomously adjusting device parameters such as collection height and injection angle based on the distribution and size of the nodules [18]. Accurately segmenting ore from the seabed background in images can thus assist mining equipment in facilitating efficient ore collection and reducing damage to the seabed environment. This research direction holds significant practical value and environmental significance.

Fig. 1. Collection of polymetallic nodules in deep-sea mining. (a) Deep-sea mining vehicle and seabed environment; (b) Hydraulic nodule collector (c) Schematic diagram of hydraulic nodule collection.

Download Full Size | PDF

This paper presents a unified approach comprising a novel data expansion technique for simulating turbid water images using clear water images, alongside an enhanced U-Net based model named the Uknot-Net. The proposed approach is specifically tailored to effectively address semantic segmentation tasks within turbid water environments. By leveraging an underwater image formation model, we introduce specific levels of additive and multiplicative noise to clear water images. This process simulates images captured across different levels of turbidity, effectively expanding the dataset and substituting for hard-to-access images acquired under turbid water conditions. Addressing the primary factors contributing to image degradation, we propose a novel model that integrates an improved dual-channel encoder to better separate backscatter noise from target signal. An experiment by simulating images of polymetallic nodules distributed on the seabed in deep-sea mining tasks is conducted in a water tank under controlled conditions in the lab. Training datasets are exclusively acquired from clear water, while test datasets encompass three levels of turbidity. Our method is compared with existing U-net based models through both quantitative and qualitative analyses. Experimental validation demonstrates significant improvements in fine segmentation of underwater images in turbid media, highlighting its effectiveness and superiority across different turbidity conditions. In summary, our study introduced novel technical approaches for deep-sea resource development, offering extensive application prospects and scientific significance.

The remainder of this paper is organized as follows. Section 2 explains the underwater image formation process, followed by the specific methodology and innovation of this study. Section 3 describes the experimental layout and the process of dataset acquisition and preparation. Section 4 discusses the semantic segmentation results. Section 5 presents the conclusions of this study.

2. Methodology

For deep learning tasks related to object recognition and segmentation in images, both the dataset preparation and model architecture design play pivotal roles. Our joint method is mainly divided into two parts in this section: The subsection 2.1 involves simulating images captured under various turbidity conditions by incorporating additive and multiplicative noise into clear water images. This procedure was carried out through the analysis of light propagation mechanisms in water and the formulation of hypotheses for specific tasks. Our data expansion method circumvents the high cost associated with obtaining image data in turbid media and the complexities of annotating targets in such images. In the subsection 2.2, the U-net model architecture is specifically enhanced to tackle the primary causes of underwater image degradation. Branches are introduced at different hierarchical levels to form a knot-like structure. This modification allows the model to separate the negative component through these branches, thereby enabling more accurate segmentation in highly turbid conditions.

2.1 Turbidity simulation for data expansion

When light travels through the water medium, the particles it encounters will absorb and scatter its photons. The likelihood of these events can be determined by the absorption coefficient ${\beta _a}$ and the scattering coefficient ${\beta _s}$ [19]. According to the Beer-Lambert law [20], the decrease in target intensity received by a camera sensor can be quantified by the total attenuation coefficient ${\beta _t}$, where ${\beta _t}$ is the sum of ${\beta _s}$ and ${\beta _a}$. The received degraded signal intensity at a certain wavelength can be expressed as:

(1)$$\begin{array}{{c}} {D({\lambda ,z} )= {L_{sc}} \cdot {e^{ - {\beta _t}(\lambda )z}},} \end{array}$$

where ${L_{sc}}$ is the radiance of the scene; z represents the distance between the scene surface and sensor, as illustrated in Fig. 2.

Fig. 2. Light propagation mechanism in water medium. (a) Under ambient light condition; (b) Under artificial Illumination condition.

Download Full Size | PDF

Participating particles scatter ambient illumination, leading to an overall increase in pixel intensity across the image. Figure 2(a) illustrates how ambient light is scattered by one of the particles along the path of the target signal propagation, and then transmit along the same path towards the sensor. This in-scattered additive noise is the main degraded reason and known as backscatter in received underwater images. The backscatter can be calculated using the following formula:

(2)$$\begin{array}{{c}} {B({\lambda ,z} )= \; \; \mathop \smallint \limits_0^z {\beta _s}(\lambda ){L_a}({\lambda ,s} ){e^{ - {\beta _t}(\lambda )s}}ds = \frac{{{\beta _s}(\lambda ){L_a}({\lambda ,z} )}}{{{\beta _t}(\lambda )}}({1 - {e^{ - {\beta_t}(\lambda )z}}} ),} \end{array}$$

where ${L_{sc}}$ is the radiance of the scene; z represents the distance between the scene surface and sensor, as illustrated in Fig. 2.

The attenuated target signal and backscatter together constitute the image captured by the camera, with both contributing respectively to the multiplicative noise and additive noise causing underwater image attenuation. In conditions of murky water or deep-sea environments, artificial lighting is required to provide image brightness, as shown in Fig. 2 (b) [21]. In the case of a point light source, the attenuation of the illumination during forward propagation must also be taken into account. The attenuated scene signal and backscatter:

(3)$$\begin{array}{{c}} {D({\lambda ,z} )= \frac{{{I_0}(\lambda )}}{{{u^2}}}R \cdot {e^{ - {\beta _t}(\lambda )({u + z} )}}\; \; ,} \end{array}$$

(4)$$\begin{array}{{c}} {B({\lambda ,z} )= \; \; \mathop \smallint \limits_0^z {\beta _s}(\lambda )\frac{{{I_0}(\lambda )}}{{{t^2}}}{e^{ - {\beta _t}(\lambda )({t + s} )}}ds,} \end{array}$$

where ${I_0}$ is the radiance intensity of point source; R is the reflectance of surface; u and t are the distance indicated on Fig. 2 (b);

Regardless of environmental lighting or artificial illumination, it is evident that the target signal weakens as distance increases, while backscatter intensifies, as indicated by Eq. (1)–(4). Our objective is not to precisely quantify these two noises at specific turbidity levels but rather to generate an expanded dataset simulating various turbidity conditions using clear water images. In deep-sea mining operations, where polymetallic nodules are typically scattered on flat seabed surfaces, the distance between targets and sensors exhibits smooth variation. Additionally, we made a simple assumption that under a certain turbidity condition, the attenuation of the target signal and the variability in backscatter are of similar magnitude across spatial changes in the acquired images. Under these simplifications, we can define a bounded parameter $\varphi $ to represent the relationship between the variations of these two constituent parts of the image signal:

(5)$$\begin{array}{{c}} {\Delta B({\lambda ,z} )= \frac{{ - \varphi }}{{H\mathrm{\ast }W}}\mathop \sum \limits_{p = 0}^{H\mathrm{\ast }W} \Delta D({\lambda ,z,p} ),} \end{array}$$

where p is the pixel number; H and W are the image size. If we disregard the negligible backscatter in clear water images, the simulated images in the expanded dataset can then be represented by the images from clear water ${I_{clear}}$, denoted as:

(6)$$\begin{array}{{c}} {{I_{simulated}}({\lambda ,z} )= \omega {I_{clear}}({\lambda ,z} )+ \varphi ({1 - \omega } )\overline {{I_{clear}}} ({\lambda ,z} ),} \end{array}$$

where $\omega $ is the parameter indicating the simulated turbidity level within the range of 0 to 1. A smaller value corresponds to higher turbidity, which correlates with the higher attenuation level of the target signal. Consequently, the backscattering also increases. We use $\varphi ({1 - \omega } )$ to adjust the extent of its increase to adapt to variations in ${\beta _s}$ and ${\beta _a}$ under different lighting conditions and water bodies due to differences in types and proportions of dissolved substances.

2.2 Enhanced U-net based architecture

Semantic segmentation is a crucial task in computer vision, aiming to assign a category label to each pixel within an image to achieve a detailed understanding of the scene. The U-Net model is particularly well-suited for semantic segmentation tasks in medical imaging, such as cell and organ segmentation, and works well for small datasets compared to other types of models [22]. Similarly, in deep-sea mining, polymetallic nodules are typically scattered on the seabed surface, exhibiting similar distribution characteristics to those in medical imaging, featuring simple and fixed structures, low semantic information, and are associated with relatively small-scale dataset [23,24].

The typical model consists of a series of convolutional and pooling layers designed in an encoder-decoder architecture, where the contracting path captures contextual information, and the expansive path refines localization. A key feature of U-Net is its utilization of skip connections, which help the model combine shallow and deep features through concatenation and preserve more detailed information during the decoding phase [25].

Following the classic U-Net model, improved models based on it are constantly being proposed. U-Net++ is an extension by introducing multiple encoding and decoding paths in its architecture and enhances the skip connections by incorporating nested skip pathways, which allow for better integration of multi-scale features [26]. Some enhanced models incorporate residual connections inspired by Res-net architectures, to facilitate the flow of gradients during training and alleviate the vanishing gradient problem, enabling the model to learn more effectively [27]. Some chose to make integration of attention mechanisms to selectively emphasize informative regions and suppress irrelevant ones during feature extraction, leading to improved segmentation accuracy [28,29]. The Transformer-based U-Net is a new trend. It combines the transformer model's self-attention mechanism with the U-net architecture to leverage the advantages of both, namely the ability of Transformers to capture long-range dependencies and the strong feature extraction capabilities of U-net [30,31]. Among these, ACC-Unet combines Transformer design decisions with a fully convolutional structure, achieving higher accuracy across multiple datasets [32].

However, for semantic segmentation tasks in turbid media where turbidity fluctuates and environment is complex, improved U-Net based models may fail to unfold superior performance. As we introduce noise to clear water images during training as depicted in Eq. (6), such simulated images also contain backscatter as additive noise. While backscatter is the main contributor to image degradation compared to signal attenuation [33], it remains challenging for deep learning neural networks to effectively address.

In our proposed Uknot network, we introduce a novel dual-path encoding mechanism to enhance the U-Net architecture, as illustrated in Fig. 3. Departing from the conventional U-Net structure, our approach splits the encoding process into two parallel paths. Each path is dedicated to extracting distinct information from the input image: one path remains focusing on capturing intrinsic features of the targets, while the other emphasizes discerning additive noise components. We differentiate by initially using one path as the primary pathway for analyzing target features, either through pre-training or transfer learning. This dual-path encoding strategy empowers the model to effectively discriminate between target intensity and backscatter noise, thereby elevating segmentation accuracy.

Fig. 3. Uknot-Net architecture.

Download Full Size | PDF

Subsequent to obtaining feature maps for each path, we conduct element-wise subtraction process between corresponding hierarchical features at the same level. This operation yields refined representations that accentuate target pixel regions while attenuating noise. During such subtractions, the dimensions of each hierarchical feature in the Uknot-Net model remain unaltered. This enables resultant output from the encoder undergoes unchanged up-sampling and concatenation operations. Our strategy ensures superior noise suppression performance and enhances the recognition of target pixel regions in the final segmentation output.

3. Experiment and dataset

3.1 Equipment setup

The camera on the deep sea mining vehicle is often installed close to the artificial illumination to capture images of the seabed floor from above, aiming downward to photograph the polymetallic nodules, as demonstrated in Fig. 4.

Fig. 4. Examples of polymetallic nodules scattered on seabed floor.

Download Full Size | PDF

To simulate the environment in our experiment, we set up a vertical downward camera over a water tank to capture the arranged scene, as shown in Fig. 5 (a). The selected camera used for experimentation is a FLIR BFS-U3-51S5PC-C industrial camera, with a resolution of 2440 × 2048 pixels. The size of Sony IMX250 MYR sensor within the camera, depicted in Fig. 5 (b), is 2/3”, with a pixel size of 3.45 µm × 3.45 µm. A FA0810A lens with a an 8 mm focal length compatible with a C-mount interface is mounted on the camera. An LED light source is installed next to the camera to provide artificial illumination.

Fig. 5. Experiment setup. (a) Imaging arrangement in water tank; (b) Camera and its sensor; (c) Pebbles used as targets.

Download Full Size | PDF

We employed a unique variety of pebbles infused with iron elements to simulate polymetallic nodules scattered on the seabed floor, as shown in Fig. 5 (c). These stones, ranging in size from 1.5 cm to 4 cm, exhibit a grey hue when dry and turn black when submerged in water. A layer of sand was laid on the bottom of the tank to simulate the seabed environment. We categorized the gravel sizes into 1.5-2.5 cm and 2.5cm-4.0 cm ranges and ensured the dataset encompassed different types, including large-sized, small-sized and mixed-sized configurations. In each arranged scenario, pebbles of different numbers and sizes are randomly scattered on the bottom to ensure that their distribution has a certain randomness and diversity.

3.2 Datasets acquisition

The main compartment of the water tank measures 48 cm x 48 cm in length and width, with an adjoining area of 29 cm x 16 cm designated for change of water without disturbing the bottom sand layer. We added approximately 110 L of tap water in each experiment to raise the water level to a height of 40 cm. A total of 64 images of different target distributions were collected under clear water conditions and used as training set and validation set for training, as presented in Fig. 6 (a).

Fig. 6. Our obtained dataset. (a) Images captured in clear water for training and validation; (b) Images captured in clear water, low turbid water, high turbid water for testing.

Download Full Size | PDF

In addition to these clear water images, we obtained 10 additional image groups as test sets under three different conditions in Fig. 6 (b). After setting up the scene and adding clear water for each image group, we proceeded to capture images of the clear water using the same method. Following this, 50 mL of skimmed milk was incrementally added to the water tank twice, simulating both low and high turbidity water conditions. Skim milk, as a complex biological nutrient solvent containing casein, protein, fat, and lactose, can effectively imitate the scattering and absorption properties of turbid water. It is often utilized in the study of underwater image recovery experiments to simulate turbid water conditions [34,35]. After each addition of skimmed milk, a sufficient settling time was allowed to ensure to ensure thorough dispersion before capturing the images. The water is then drained and the residual solution is flushed out of the water tank to prepare for the setup of next image group.

3.3 Preprocessing

In deep learning, data augmentation process enhances dataset diversity and size by applying diverse transformations to existing samples [36]. Particularly crucial for small datasets, this process boosts model generalization by exposing it to varied patterns such as size, shape, color tone, and contrast. In this study, we employed various augmentation techniques, including random horizontal flipping, resizing with aspect ratio adjustment, and HSV color space conversion followed by random linear transformations of hue, saturation, and value. Parameterized within the range of 0 to 2, these transformations enriched the dataset, fostering model robustness and improving generalization capability.

We also applied our proposed data expansion method under three different parameter setting conditions for comparative analysis based on Eq. (6): Condition 0: Training was conducted solely on original images without simulating turbidity; Condition 1: The parameter ω in Eq. (10), representing the turbidity level, was randomly sampled from five values ranging from 0 to 1 with an interval of 0.2, while the parameter φ was not considered and set to 1; Condition 2: The parameter φ was randomly sampled from values between 0.25 and 2, representing the fluctuation of backscattering coefficient. An upper limit of 2 was selected to prevent pixel value saturation.

These strategies were implemented to investigate their impact on segmentation performance within turbid water environments. We partitioned dataset in Fig. 6 (a) into training and validation sets, allocating 25% of the total data for validation and the remaining 75% for training. Considering the limited size of the obtained dataset, we replicate the training set and validation set five times each to mitigate the backpropagation fluctuations caused by turbidity differences during the evaluation, thereby using prior knowledge to expand the data accordingly. The images in Fig. 6 (b) are divided into three different test sets according to turbidity.

Our training strategy is as follows: Images obtained from experiments are resized to 256*256 as input images for training to reduce computational cost. A combination of dice loss and focal loss is employed as our loss function to address various challenges. dice loss is typically used in image segmentation tasks, as it is more sensitive to predicting small targets, while focal loss helps mitigate class imbalance issues resulting from data expansion. We use the Adam optimizer with an initial learning rate of 1e-4. Additionally, VGG16 serves as the backbone feature extraction network in our Uknot-Net model for loading of pre-trained weights. The batch size is set to 1, and the freeze epoch is set to 50 to employ transfer learning strategy.

For training dataset, we utilized LabelMe to annotate the targets within the images and converted them into pseudo-color labeled images to obtain the label dataset [37]. For the test dataset, targets were annotated in images taken in clear water to acquire labeled images, which were subsequently employed as labels for images captured under different turbid conditions within the same image groups.

4. Results and discussions

In this section, we present the results obtained from our proposed data expansion method and enhanced dual-channel encoder U-Net model: Uknot-Net, and evaluate its performance on a test dataset consisting of three turbidity levels. We employ VGG16 and Resnet50 as the backbone feature extraction network in the comparable U-Net models for pre-training loading. These results are compared with several U-net architectures, including VGG16-based U-Net model, ResNet50-based U-Net model, VGG16-based U-Net++ model without pruning, ResNet50-based U-Net model improved by GCT channel attention, and ACC-UNet model based on self-attention.

For deep sea mining operations, the proportion of targets in the images is typically low compared to the seabed floor. Since minimizing both misidentification and omission rates is of paramount importance, the Intersection over Union (IoU) metric is chosen to gauge the congruence between model predictions and ground truth labels [38]. IoU, also known as the Jaccard index, is widely employed in target recognition and defined as the ratio of the intersection area to the union area of the predicted result and the ground truth. Its representation typically follows:

(7)$$\begin{array}{{c}} {IoU = \frac{{TP}}{{TP + FP + FN}},} \end{array}$$

Where TP is the positive sample with true prediction, FP is the negative sample with wrong prediction, and FN is the positive sample with wrong prediction. Ranging from 0 to 1, higher value denotes a stronger overlap between model predictions and ground truth labels.

4.1 Quantitative analysis

In our experiments, we expanded the data by simulating turbidity in different degrees, and observed a notable enhancement in the performance of the model from Table. 1. Using our data expansion schemes, the accuracy of the model is significantly improved, especially in the case of high turbidity, the accuracy of the enhanced model is generally increased by an average of 2 to 3 times. For test set of high turbidity, the IoU score of our Uknot-Net model increased significantly from 40.58% to 89.09% with Con. 1 and 92.06% with Con. 2. This shows that our data expansion method is of great significance for improving the robustness and generalization ability of the model.

We also evaluated the performance of several typical U-Net architectures under data using our expansion schemes and under different turbidity conditions. The results demonstrate that our proposed Uknot model exhibits superior recognition capabilities under high turbidity compared to other models, showcasing better robustness across different conditions. Utilizing “Con.2”, our Uknot-Net model achieved IoU indices of 95.74%, 93.61%, and 92.06% under clear water, low turbidity, and high turbidity conditions, respectively.

While there was a slight decrease of our model in IoU scores under high turbidity compared to low turbidity and clear water conditions, other U-net based improved models generally performed poorly under high turbidity conditions. Specifically, models utilizing residual network as backbones and incorporating channel or self-attention mechanisms showed low IoU scores, exhibiting insufficient effectiveness even with our data expansion technology. This suggests that methods relying on residual and attention mechanisms are ineffective in handling noise introduced by turbid media, thereby diminishing recognition capabilities for high turbidity images. In contrast, the original U-net model utilizing a simple encoder-decoder architecture, demonstrated relatively satisfactory recognition capabilities with our prior knowledge-based data expansion method. Although U-Net++ approached the IoU scores of the basic U-Net model, its dense connections did not enhance recognition performance for high turbid test set.

From Table 1, it is evident that our Uknot-Net model, an advanced version of the original U-Net architecture incorporating a dual-channel encoding mechanism, effectively addresses the challenge and surpasses the performance of other models considered in the experiment. This innovation enables the preservation of high recognition accuracy and exhibit performance improvements over the conventional U-Net architecture under high turbidity. Furthermore, through the utilization of a data expansion technique that simulates turbidity, our model achieves similar recognition capabilities to those observed under clear water and low turbidity conditions. These findings underscore the efficacy and superiority of the Uknot-Net model in tasks related to target recognition in turbid water environments.

Table 1. IoU scores comparison under different conditions

View Table

4.2 Qualitative analysis

In quantitative analysis, enhanced versions of the U-net architecture did not exhibit improvements as seen in conventional target segmentation tasks. Instead, they performed poorly in tasks under highly turbid conditions. Hence, in this subsection, we performed qualitative analysis of the test set in highly turbid image scenarios by visually comparing the segmentation results of the original U-Net model and our proposed Uknot-Net model under data expansion schemes against our labeled ground truth.

It is evident in Fig. 7 that without simulating turbidity, both model architectures struggled to effectively identify targets in the majority of the test images, with Uknot-Net model displaying slightly superior recognition capabilities. By employing our data expansion method, recognition performance witnessed a significant improvement, demonstrating greater robustness across different images. Furthermore, utilizing additional scaling factors for backscatter in data expansion resulted in enhanced performance, particularly in identifying certain nodules within trouble areas, leading to higher recognition accuracy. Uknot-Net model also outperformed U-net model in recognizing targets in the vast majority regions across different images.

Fig. 7. Segmentation results between the U-Net model and our proposed Uknot model using different data expansion schemes under high turbid water conditions.

Download Full Size | PDF

To provide a more comprehensive comparison, we evaluated the results of different models on two selective highly turbid images from the test set, which respectively exhibited instances of missed detection and misidentification, thus offering meaningful visual comparisons. It is observed that apart from the Uknot-Net model, U-Net, and U-Net++ using vgg16 as backbone, other improved models performed poorly in both images, similar to the IoU metrics discussed in the previous subsection. Therefore, the following models specifically refer to results based on vgg16 architecture. From the first comparison in Fig. 8, the Uknot-net model can recognize almost all targets, although only a portion of the target at the bottom of the image is detected. This could be due to the pebble being located near the edge of the image and not being easily detected. In contrast, the U-net model fails to detect three targets, while the U-net++ model only partially identifies two targets.

Fig. 8. Visual comparison of the (9) and (5) test images under high turbidity in specific regions of our segmentation results between different models.

Download Full Size | PDF

In the second comparison of Fig. 8, shadows caused by the direction of the light source result in easily misidentifications. Specifically, the U-net model exhibits larger errors in identifying target areas, while the U-net++ model performs relatively better but still produces several misidentification results. However, our proposed Uknot-net model successfully identifies all correct target areas and exhibits the fewest misidentification areas. Additionally, each individual misidentified target area is very small and can be filtered out through post-processing to enhance result accuracy.

Through qualitative and quantitative comparisons, it's evident that our proposed data expansion method, simulating turbidity based on prior knowledge of underwater optical imaging principles, results in a significant improvement of 2-3 times in overall data performance and regional precise identification. Moreover, our Uknot model, compared to other models, exhibits superior recognition capabilities in highly turbid conditions. It achieves fewer misidentifications and omissions in complex areas, allowing it to maintain comparable performance in high turbidity to clear water and low turbidity environments.

5. Conclusion

This study proposes a joint approach comprising a data expansion approach and an enhanced U-Net model to achieve semantic segmentation tasks in turbid media. By utilizing an underwater optical imaging model to simulate images in water with varying turbidity levels for data expansion, our approach enables deep learning models trained on datasets obtained in clear water environments to tackle the challenges of target identification under different turbid water conditions. We also improve upon the widely used U-net model to enhance the segmentation accuracy of images in highly turbid environments by introducing the Uknot network structure, which incorporates a dual-channel encoder. We conducted experiments simulating the distribution of polymetallic nodules in a water tank to validate the feasibility of our approach. Through comprehensive qualitative and quantitative analysis, our research has achieved significant results. The data expansion method resulted in a 2-3 times improvement in results. Furthermore, the Uknot-net model effectively mitigates the additive noise caused by backscattering in turbid images through the additional channel encoding mechanism, thereby improving segmentation accuracy and model robustness. Our method demonstrates better robustness and generalization capabilities compared to other models, exhibiting superior performance in high turbidity environments with fewer misidentifications and omissions.

For future work, we will continue to explore and optimize the model architecture to enhance segmentation ability in turbid media. Additionally, we plan to incorporate assisted techniques such as underwater active polarization methods to evaluate the effectiveness of our approach in highly turbid underwater environments. To assess robustness beyond deep-sea mining, we will acquire data and conduct tests in natural settings and expand its applicability to diverse and complex underwater scenarios requiring underwater target detection and recognition.

Overall, our study provides an effective solution and practical value for the fine identification of targets in deep-sea mineral images in various turbid environments, thereby aiding in efficient collection for commercial mining tasks and marine environment conservation efforts. Given the analogous challenges encountered in target identification across various turbid media environments, our proposed joint approach demonstrates promising practical applicability and research significance.

Funding

Major Projects of Strategic Emerging Industries in Shanghai (BH3230001); Fundamental Research Funds for the Central Universities; Institute of Marine Equipment of Shanghai Jiao Tong University.

Acknowledgment

This work is supported by the Major Projects of Strategic Emerging Industries in Shanghai (BH3230001), Fundamental Research Funds for the Central Universities and Institute of Marine Equipment of Shanghai Jiao Tong University.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. J. G. A. Barbedo, “A review on the use of computer vision and artificial intelligence for fish recognition, monitoring, and management,” Fishes 7(6), 335 (2022). [CrossRef]

2. X. C. Wang, Y. Wu, M. H. Xiao, et al., “Research progress of intelligent identification technology in aquaculture,” Journal of South China Agricultural University 44(1), 24–33 (2023). [CrossRef]

3. K. Zhao, T. He, S. Wu, et al., “Application research of image recognition technology based on CNN in image location of environmental monitoring UAV,” J Image Video Proc. 2018(1), 150 (2018). [CrossRef]

4. G. J. Sun and H. Y. Lin, “Robotic grasping using semantic segmentation and primitive geometric model based 3D pose estimation,” In 2020 IEEE/SICE International Symposium on System Integration (SII), pp. 337–342 (2020).

5. G. Du, K. Wang, S. Lian, et al., “Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review,” Artif. Intell. Rev. 54(3), 1677–1734 (2021). [CrossRef]

6. G. Böer and H. Schramm, “Semantic Segmentation of Marine Species in an Unconstrained Underwater Environment,” In International Conference on Robotics, Computer Vision and Intelligent Systems, pp. 131–146 (Springer International Publishing, 2020).

7. N. Gracias, R. Garcia, R. Campos, et al., “Application challenges of underwater vision,” Computer Vision in Vehicle Technology: Land, Sea & Air 133–160 (2017).

8. Y. Shen, C. Zhao, Y. Liu, et al., “Underwater optical imaging: Key technologies and applications review,” IEEE Access 9, 85500–85514 (2021). [CrossRef]

9. X. Yuan, L. Guo, C. Luo, et al., “A survey of target detection and recognition methods in underwater turbid areas,” Appl. Sci. 12(10), 4898 (2022). [CrossRef]

10. L. Liu, X. Li, J. Yang, et al., “Fast image visibility enhancement based on active polarization and color constancy for operation in turbid water,” Opt. Express 31(6), 10159–10175 (2023). [CrossRef]

11. Y. Wang, W. Song, G. Fortino, et al., “An experimental-based review of image enhancement and image restoration methods for underwater imaging,” IEEE Access 7, 140233–140251 (2019). [CrossRef]

12. N. Wang, Y. Wang, and M. J. Er, “Review on deep learning techniques for marine object recognition: Architectures and algorithms,” Control Engineering Practice 118, 104458 (2022). [CrossRef]

13. G. Hu, K. Wang, Y. Peng, et al., “Deep learning methods for underwater target feature extraction and recognition,” Computational intelligence and neuroscience 2018, 1–10 (2018). [CrossRef]

14. Y. Mo, Y. Wu, X. Yang, et al., “Review the state-of-the-art technologies of semantic segmentation based on deep learning,” Neurocomputing 493, 626–646 (2022). [CrossRef]

15. P. Tarling, M. Cantor, A. Clapés, et al., “Deep learning with self-supervision and uncertainty regularization to count fish in underwater images,” PLoS One 17(5), e0267759 (2022). [CrossRef]

16. T. Kuhn, A. Wegorzewski, C. Rühlemann, et al., “Composition, formation, and occurrence of polymetallic nodules,” Deep-sea mining: Resource potential, technical and environmental considerations 7, 23–63 (2017). [CrossRef]

17. K. Amudha, S. K. Bhattacharya, R. Sharma, et al., “Influence of flow area zone and vertical lift motion of polymetallic nodules in hydraulic collecting,” Ocean Eng. 294, 116745 (2024). [CrossRef]

18. G. Zhao, L. Xiao, Z. Yue, et al., “Performance characteristics of nodule pick-up device based on spiral flow principle for deep-sea hydraulic collection,” Ocean Eng. 226, 108818 (2021). [CrossRef]

19. D. Berman, T. Treibitz, and S. Avidan, “Single image dehazing using haze-lines,” IEEE Trans. Pattern Anal. Mach. Intell. 42(3), 720–734 (2020). [CrossRef]

20. M. Pelka, M. Mackenberg, C. Funda, et al., “Optical underwater distance estimation,” In OCEANS 2017-Aberdeen, pp. 1–6 (IEEE, June 2017).

21. T. Treibitz and Y. Y. Schechner, “Active polarization descattering,” IEEE Trans. Pattern Anal. Mach. Intell. 31(3), 385–399 (2009). [CrossRef]

22. M. K. Kar, M. K. Nath, and D. R. Neog, “A review on progress in semantic image segmentation and its application to medical images,” SN Comput. Sci. 2(5), 397 (2021). [CrossRef]

23. W. Song, N. Zheng, X. Liu, et al., “An improved u-net convolutional networks for seabed mineral image segmentation,” IEEE Access 7, 82744–82752 (2019). [CrossRef]

24. H. Wang, L. Dong, W. Song, et al., “Improved U-Net-Based Novel Segmentation Algorithm for Underwater Mineral Image,” Intelligent Automation & Soft Computing 32(3), 1573–1586 (2022). [CrossRef]

25. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241 (Springer International Publishing, 2015).

26. Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, et al., “Unet++: A nested u-net architecture for medical image segmentation,” In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Proceedings 4, pp. 3–11 (Springer International Publishing, 2018).

27. H. Zhang, X. Hong, S. Zhou, et al., “Infrared image segmentation for photovoltaic panels based on Res-UNet,” In Chinese conference on pattern recognition and computer vision (PRCV), pp. 611–622 (Springer International Publishing, October 2019).

28. O. Oktay, J. Schlemper, L. L. Folgoc, et al., “Attention u-net: Learning where to look for the pancreas,” arXiv, arXiv preprint arXiv:1804.03999 (2018). [CrossRef]

29. D. Maji, P. Sigedar, and M. Singh, “Attention Res-UNet with Guided Decoder for semantic segmentation of brain tumors,” Biomedical Signal Processing and Control 71, 103077 (2022). [CrossRef]

30. O. Petit, N. Thome, C. Rambour, et al., “U-net transformer: Self and cross attention for medical image segmentation,” In Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings 12, pp. 267–276 (Springer International Publishing, 2021).

31. A. Lin, B. Chen, J. Xu, et al., “Ds-transunet: Dual swin transformer u-net for medical image segmentation,” IEEE Trans. Instrum. Meas. 71, 1–15 (2022). [CrossRef]

32. N. Ibtehaz and D. Kihara, “Acc-unet: A completely convolutional unet model for the 2020s,” In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 692–702 (Springer Nature Switzerland, October 2023).

33. D. Akkaynak and T. Treibitz, “Sea-thru: A method for removing water from underwater images,” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1682–1691 (2019).

34. P. Han, F. Liu, K. Yang, et al., “Active underwater descattering and image recovery,” Appl. Opt. 56(23), 6631–6638 (2017). [CrossRef]

35. R. Zhang, X. Gui, H. Cheng, et al., “Underwater image recovery utilizing polarimetric imaging based on neural networks,” Appl. Opt. 60(27), 8419–8425 (2021). [CrossRef]

36. C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” J. Big Data. 6(1), 60 (2019). [CrossRef]

37. B. C. Russell, A. Torralba, K. P. Murphy, et al., “LabelMe: a database and web-based tool for image annotation,” Int. J. Comput. Vis. 77(1-3), 157–173 (2008). [CrossRef]

38. H. Choi, H. J. Lee, H. J. You, et al., “Comparative analysis of generalized intersection over union,” Sens. Mater. 31(11), 3849–3858 (2019). [CrossRef]

Model	Expansion	Water Environment Turbidity
Model	(Tur.Sim.)	Clean	Low	High
Uknot-Net (vgg16)	Con. 0	95.59	91.97	40.58
	Con. 1	95.74	93.61	89.09
	Con. 2	95.71	93.17	92.06
U-Net (vgg16)	Con. 0	95.68	86.00	27.66
	Con. 1	95.78	93.52	86.62
	Con. 2	95.81	93.15	90.57
U-Net (Resnet50)	Con. 0	95.65	84.9	21.64
	Con. 1	95.61	92.61	38.99
	Con. 2	95.56	92.81	48.51
U-Net++ (Vgg16)	Con. 0	95.65	84.9	21.64
	Con. 1	95.73	93.59	88.88
	Con. 2	95.7	93.34	90.99
U-Net-GCT(Resnet50)	Con. 0	95.6	83.55	18.95
	Con. 1	95.62	89.27	38.1
	Con. 2	95.65	85.02	16.65
ACC-UNet	Con. 0	95.72	2.21	0.00
	Con. 1	95.55	28.9	0.02
	Con. 2	95.76	73.64	10.70

Target recognition and segmentation in turbid water using data from non-turbid conditions: a unified approach and experimental validation

Abstract

1. Introduction

2. Methodology

2.1 Turbidity simulation for data expansion

2.2 Enhanced U-net based architecture

3. Experiment and dataset

3.1 Equipment setup

3.2 Datasets acquisition

3.3 Preprocessing

4. Results and discussions

4.1 Quantitative analysis

4.2 Qualitative analysis

5. Conclusion

Funding

Acknowledgment

Disclosures

Data availability

References

Data availability

Cited By

Figures (8)

Tables (1)

Equations (7)

Optics Express