Multi-signal feature fusion method with an attention mechanism for the &#x03A6;-OTDR event recognition system

Yi Shi; Jiewei Chen; Shangwei Dai; Xinyu Liu; Chuliang Wei

doi:10.1364/OE.472794

1. Introduction

Phase sensitive optical time domain reflectometer (Φ-OTDR) was proposed in 1993 [1]. The interference effect of backward Rayleigh scattering light enables it to detect weaker disturbances than traditional OTDR. S. V. Shatalin et al. established the classical Φ-OTDR theoretical model and proposed that the system can be used for distributed temperature and disturbance monitoring [2]. Φ-OTDR has the characteristics of anti-electromagnetic interference, anti-corrosion and high sensitivity, and has been widely used in many field applications, such as safety monitoring for pipeline, submarine power cable and railway transportation [3–6]. The false alarm rate of Φ-OTDR sensing system has always been one of the current challenges [7]. With the development of artificial intelligence, more and more machine learning and deep learning algorithms are being applied in the field of event recognition for optical fiber sensor. Cao et al. proposed the application of support vector machine (SVM) to the vibration classification of Φ-OTDR sensing system [8]. Wu et al. used a three-layer multi-layer perceptron (MPL) to classify three different event signals [9]. Shi et al. proposed a convolutional neural network (CNN) with transfer learning method to identify spatial-temporal data of vibration signals with fast training speed [10,11]. Jiang et al. used Mel-Frequency Cipstal Coefficients (MFCC) as the input of CNN for event recognition [12]. Shi et al. proposed a dual-augmentation method to enhance the recognition process with few training samples [13]. Many Φ-OTDR event identification methods only use one representation of signal, such as single time domain signal, frequency domain signal, spatial-temporal signal or MFCC signal, to recognize the event type. The expression ability of the single representation of signal is limited, which will cause a bottleneck for the recognition model.

Recently, some multi-signal methods are proposed for more accurate event recognition. J. Wang et al. used both time-domain and frequency-domain signals to classify events by the Random Forest Algorithm (RF) [14]. C. Xu et al. used short-time energy ratio, short-time level crossing rate, vibration duration and power spectrum energy ratio with SVM to identify events [15]. However, those methods mainly use the traditional machine learning method and fuse the multiple signal representations before the model input. This superposition of different representations will weaken their unique features and limits the recognition accuracy. Besides, the unique features will appear at different spatial positions in different signal representations. It is better to use different networks to extract their features. In this paper, we propose a multi-signal feature fusion method with attention mechanism, which can fuse more features in feature domain, for Φ-OTDR event recognition. Each signal representation is fused after feature extraction to get richer features. Besides, with the help of a layer pruning method based on attention mechanism, the network size can be kept, avoiding computation increase and significant recognition accuracy reduction. Experiment result shows that this method can improve the classification accuracy compared with both single representation approach and traditional multiple signal representation fusion method.

2. Methodology

2.1 Φ-OTDR system and data acquisition

For completeness, the home-made Φ-OTDR system is shown in Fig. 1. It is a conventional structure and relatively cheap. The laser light comes from an ultra-narrow line-width laser (NLL) with 3kHz line-width and then be modulated into pulses through an acoustic optical modulator (AOM) with 200MHz frequency shift. In order to compensate the insertion loss, an Erbium doped fiber amplifier (EDFA) is applied. The probe pulses are injected into the sensing fiber through a circulator and the Rayleigh backscattered light (RBS) is received by a photodetector (PD). The output of PD is acquired by a data acquisition card (DAC) and then processed by a personal computer (PC). The sensing optical fiber is a kilometer-long G652 single mode fiber with steel wire reinforcement and polyvinyl chloride cladding, and it is buried 5cm below the ground surface. Eight types of events, which are sunny background (No.I), rainy background (No.II), walking (No.III), jumping (No.IV), water washing (No.V), beating with a shove (No.VI), digging with a shove (No.VII) and riding bicycle (No.VIII), are applied to the sensing fiber at five different locations, in different days and under different weather conditions (rainy and sunny). The detail of each event type is described in Table 1. It needs to mention that some of these categories are with very similar characteristics, such as No.I and No.II, No.VII and No.VIII, which can be used to test the ability of the event recognition method. The pulse repetition rate is set to be 20kHz. Two pulse widths, 100ns and 200ns, are applied and the data collected under both two different pulse widths with the same event type are treated as the data samples belongs to that event class. The sampling rate of DAC is 50MHz. The RBS within 1 second from the vibration location are extracted as one data sample and the direct current of the data sample is removed.

Fig. 1. Set up of Φ-OTDR system.

Download Full Size | PDF

Table 1. The description of event classes

View Table | View all tables in this article

2.2 Structure of the recognition method

In this section, a multi-signal feature fusion method with attention mechanism is proposed for Φ-OTDR event recognition. This method includes five main steps: signal transformation, feature extraction, feature fusion, attention mechanism pruning and classification, as shown in Fig. 2. The event signal is firstly transformed into different representations through different signal transformation methods, such as Fourier transformation, MFCC transformation, temporal domain pick-up and so on. Different signal representations will supply the event unique feature in different sight. However, if these event unique features are superimposed in the input step, they may be weakened due to the overlap. Thus, let these different signal presentations pass through a feature extraction layer and be fused in the feature domain. In this step, the dimension of the signal features will increase greatly and supply more unique information. However, the network size and the computation will also increase due to the large amount of features. The attention mechanism pruning layer is used to pick up the most important features and prune the unimportant network branches to avoid the computation increase and make the network converged faster. Then the classification layer is applied to decide the event type based on the fused feature. It needs to mention that the feature extraction layer and classification layer can be realized by different lightweight CNN models.

Fig. 2. Architecture of the proposed method.

Download Full Size | PDF

2.3 Multi-signal feature fusion method

The multi-signal feature fusion method is proposed in this section, which fused different signal representations in feature domain, instead of the directly stacking of multiple signal representations. Assume ${X^i} \in {{\mathbb R}^{c \times h \times w}}$ is the representation of an event signal, where $c$ is the channel size, h and w are the image size and $i \in [1,2, \ldots ,m]$ is the number of different representations, such as frequency domain representation, time domain representation, MFCC representation etc. Then the multi-signal features after the feature extraction layer can be expressed as,

(1)$${\overline X ^i} = F({X^i}),\overline X \in {{\mathbb R}^{c^{\prime} \times h^{\prime} \times w^{\prime}}}$$

where, $F({\cdot} )$ denotes the feature extraction layer and the channel size will be increased to $c^{\prime}$. Then, the extracted features will be fused in the feature fusion layer,

(2)$$Y = {\overline X ^1} \oplus {\overline X ^2} \oplus \cdots \oplus {\overline X ^m},\;\;Y \in {{\mathbb R}^{c^{\prime\prime} \times h^{\prime} \times w^{\prime}}}$$

where, ⊕ means channel superposition and the channel size will be increased to $c^{\prime\prime} = mc^{\prime}$. The fused features have a large channel dimension and will increase the computational complexity. Besides, some of features are important and some are not. Thus, a pruning method based on attention mechanism is given next.

2.4 Attention mechanism pruning method

In order to solve the problem above and make the method suitable for embedded devices, a pruning layer based on attention mechanism is applied between the feature fusion layer and the final classification layer, which is used to pick up the important features and prune the unimportant network branches to reduce the network size and computation amount.

The attention mechanism pruning layer consists of an average pooling layer and an activate layer, two fully connected layers, and a Sigmoid function [16]. The fused feature is firstly compressed by global averaging pooling,

(3)$$Z = \textrm{G}(Y),\;\;Z \in {{\mathbb R}^{c^{\prime\prime}}}$$

where, $\textrm{G}({\cdot} )$ represents the channel average pooling function, which can be expressed as,

(4)$$\textrm{G}(Y) = \frac{1}{{h^{\prime} \times w^{\prime}}}\sum\nolimits_{i = 1}^{h^{\prime}} {\sum\nolimits_{j = 1}^{w^{\prime}} {{Y_{c^{\prime\prime}}}(i,j)} }$$

After the global average pooling, the feature dimension $[h^{\prime} \times w^{\prime}]$ will be compressed to $[1 \times 1]$. It then goes through two fully connected layers to generate the attention weight which includes individual channel information. The specific implementation function can be expressed as,

(5)$${W_A} = \sigma ({W_2}(\delta ({W_1}(Z)))),\;\; {W_A} \in {{\mathbb R}^{c^{\prime\prime}}}$$

where, ${W_1}$ and ${W_2}$ refer to two fully connected layers, $\delta$ is the ReLU function and $\sigma$ is the Sigmoid function. Based on the attention mechanism, a weight value is assigned to each channel of the fused feature. Subsequently, the latent layers will be sorted from largest to smallest according to the generated attention weights ${W_A}$. The pruning process is defined as,

(6)$$\overline Y = P(Y,{W_A},n),\;\; \overline Y \in {{\mathbb R}^{nc^{\prime\prime} \times h^{\prime} \times w^{\prime}}}$$

where, the prune $P({\cdot} )$ means to extract the top $nc^{\prime\prime}$ channels with larger attention weights from Y. The parameter n is a manually set threshold between [0,1]. This operation will save the important feature component and reduce the network width and computation amount. It ensures the application of the proposed method to be used in embedded devices. Then the fused and pruned feature $\overline Y$ is input into the final classification layer.

3. Experiments and results

3.1 Data set preparation

In this section, the time domain representation, the frequency domain representation and the MFCC representation are used as three signal representations to test the proposed method. MFCC transformation is a temporal-frequency analysis which takes more attention to the lower frequency part [17]. As the soil shows low-pass filter characteristics, MFCC is suitable for underground vibration signal analysis [18]. For completeness, the process of MFCC is described here: (I) Pre-emphasize the time domain signal, then decompose it into short frames and multiply each frame by a Hamming window. (II) Calculate discrete Fourier transform (DFT) of each frame. (III) Let the DFT spectrum pass through a Mel filter bank to obtain the Mel cepstrum. In this experiment, the AlexNet [19], which is a well-known lightweight CNN structure, is chosen to be the infrastructure for feature extraction layer and classification layer. In order to make these signal representations suitable for the input size of AlexNet, these signal representations are plotted and saved as a 299 × 299 RGB image. The typical time domain representation, frequency domain representation, and MFCC representation are shown in Fig. 3. All of the vibration data samples are divided into training, validation and test set in a ratio of approximately 7:1.5:1.5. The detail of data sets is shown in Table 2. There are more than 1000 samples in the most categories and more than 10000 samples in total, which makes this data set large enough for network training.

Fig. 3. Typical time domain, frequency domain and MFCC representation.

Download Full Size | PDF

Table 2. The number of each type of event

View Table | View all tables in this article

3.2 Comparative experiments

In order to verify the effectiveness of our method, classification comparison test with single representation approach is carried out. The single representation approach only uses one signal representation as the input, such as frequency domain representation (Frequency Only), time domain representation (Time Only), or MFCC representation (MFCC Only). Besides, the comparison has been repeated 5 times by randomly composing the training, verification and test set (regarded as Group I to V). The results are shown in Fig. 4. As the multiple signal representations can offer more unique feature for classification, the feature fusion method can achieve better classification accuracy (97.93%) than the single representation approach, with 3.52% improvement compared to the MFCC Only approach (the best single representation approach).

Fig. 4. Comparison of recognition accuracy between one representation approach and the proposed method.

Download Full Size | PDF

Then the comparison between different fusion methods is conducted. Two other common fusion approaches, as shown in Fig. 5, are applied for the comparison. Method I: the three representations of an event signal are fused in RGB channels with the same superimposition ratio at the input step and then be sent into the AlexNet for classification. Method II: the three representations of an event signal are stacked on the channel dimension, and then be sent into the AlexNet for classification. This comparison also repeats five times using the five randomly composed data set groups in Fig. 4. The result is shown in Fig. 6. Compared with these two fusion methods, our proposed method can achieve better classification results. This is due to a more efficient usage of features. In Method I, the features of different signal presentation are superimposed with each other, which may weaken the visualization of these features. In Method II, the feature overlapping is avoided but the number of features after the feature extraction layers is relatively small compared with our proposed method. This makes the Method II performs better than Method I but worse than our method.

Fig. 5. The process of different multi-signal fusion methods. (a) Method I. (b) Method II.

Download Full Size | PDF

Fig. 6. Comparison of recognition accuracy between different fusion methods

Download Full Size | PDF

3.3 Experiment on influence degree of each feature

In order to explore the influence of each signal representation, experiment compared with only two signal representations is carried out. The results are shown in Fig. 7. From Fig. 7, it can be found that if the MFCC signal is not applied, the accuracy of the model on the training set will be lower compared with the other groups of experiments. This also verifies that the MFCC signal offers a bigger difference in underground fiber optic sensing application.

Fig. 7. Comparison of classification accuracy on training set between different signal representations

Download Full Size | PDF

3.4 Experiment of exploring model threshold n setting

In order to avoid the growth of network width and computation amount caused by feature fusion, an attention-mechanism-based method is introduced for network pruning. In order to explore the best threshold n in Eq. (6), the classification accuracy variation in the training process under different threshold n is obtained and shown in Fig. 8. A smaller n means less computation. From Fig. 8, it can be found that when n = 0.33, which means the network is cropped to 1/3, the accuracy only declines 0.34% compared to the unpruned network. This degree of classification accuracy loss is acceptable. Thus n = 0.33 is preliminary selected. Then the network size and the classification performance on the test set are compared under different signal representations with the same baseline network. The details and results are shown in Table 3. From Table 3, when n = 0.33, both the classification accuracy loss and the network size increase can be avoided. The confusion matrix of our method with n = 0.33 is shown in Fig. 9.

Fig. 8. Classification accuracy on the training set with different threshold n value.

Download Full Size | PDF

Fig. 9. Confusion matrix of the proposed method with n = 0.33.

Download Full Size | PDF

Table 3. Performance comparison of each method

View Table | View all tables in this article

4. Discussion

4.1 Signal representation diversity

In order to show the applicability of our method in face of more representations of a signal, one more representation, spatial-temporal representation, is added. The spatial-temporal matrix is the RBS within 1s and 50 meters two side range near the vibration location [11]. The horizontal direction of the data matrix denotes the spatial domain and the vertical direction denotes the temporal domain. The typical spatial-temporal signals are shown in Fig. 10. The results of the comparison between feature fusion with different signal representations are shown in Table 4. From the comparison, it can be seen that fusing more signal features can help improve the classification accuracy. It also proves that the proposed method can be applied to more representations for classification. Confusion matrix of the four representations fusion test is shown in Fig. 11.

Fig. 10. Typical spatial-temporal samples of different events.

Download Full Size | PDF

Fig. 11. Confusion matrix our method with four signal representations.

Download Full Size | PDF

Table 4. Comparison of different multi-signal representation fusions

View Table | View all tables in this article

4.2 Discussion on lightweight network diversity

In embedded devices, we tend to apply lightweight CNN due to the storage limitations. Some lightweight CNNs are AlexNet, SqueezeNet [20], EfficienNet [21], etc. In order to test the applicability of our method, these three networks are applied for the feature extraction and classification. Three signal representations method, including time domain representation, frequency domain representation and MFCC representation, and the MFFC-Only method are carried out for comparison. The results are shown in Table 5. It shows that the proposed method can be well applied for different lightweight networks.

Table 5. Application of different lightweight network models

View Table | View all tables in this article

5. Conclusion

In this paper, a multi-signal feature fusion method with attention mechanism for Φ-OTDR event recognition is proposed. The proposed method with three signal representations can achieve an average recognition accuracy of 97.93%, which improves approximately 3.52% compared with using MFCC representation alone. Compared with other multi-signal fusion methods, the classification accuracy can also be improved. In addition, the proposed method with four signal representations can further achieve a recognition accuracy of 99.11%. Besides, with the help of attention mechanism pruning layer, the whole network with feature fusion method can keep the similar number of parameters as the original network and avoid significant degradation in recognition accuracy. We also verify that this method can be applied well in other lightweight CNNs.

Funding

National Natural Science Foundation of China (61801283); Basic and Applied Basic Research Foundation of Guangdong Province (2021A1515012001).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data will be made available upon reasonable request.

References

1. H. F. Taylor and C. E. Lee. “Apparatus and method for fiber optic intrusion sensing,” U.S. Patent 5,194,847 (16 March 1993).

2. S. V. Shatalin, V. N. Treschikov, and A. J. Rogers, “Interferometric optical time-domain reflectometry for distributed optical fiber sensing,” in SPIE’s International Symposium on Optical Science, Engineering, and Instrumentation, International Society for Optics and Photonics (1998), pp.181–191.

3. A. Lv and J. Li, “On-line monitoring system of 35 kV 3-core submarine power cable based on φ-OTDR,” Sens. Actuators, A 273(15), 134–139 (2018). [CrossRef]

4. M. Filograno, C. Riziotis, and M. Kandyla, “A Low-Cost Phase-OTDR System for Structural Health Monitoring: Design and Instrumentation,” Instruments 3(3), 46 (2019). [CrossRef]

5. G. Ma, C. Shi, W. Qin, Y. Li, H. Zhou, and C. Li, “A Non-Intrusive Electrical Discharge Localization Method for Gas Insulated Line Based on Phase-Sensitive OTDR and Michelson Interferometer,” IEEE Trans. Power Delivery 34(4), 1324–1331 (2019). [CrossRef]

6. M. He, L. Feng, and J. Fan, “A method for real-time monitoring of running trains using Ф-OTDR and the improved Canny,” Optik 184, 356–363 (2019). [CrossRef]

7. F. Peng, N. Duan, Y. Rao, and J. Li, “Real-Time Position and Speed Monitoring of Trains Using Phase-Sensitive OTDR,” IEEE Photonics Technol. Lett. 26(20), 2055–2057 (2014). [CrossRef]

8. C. Cao, X. Fan, Q. Liu, and Z. He, “Practical Pattern Recognition System for Distributed Optical Fiber Intrusion Monitoring Based on Ф-COTDR,” ZTE Communications 15(3), ASu2A.145 (2017). [CrossRef]

9. H. Wu, X. Li, H. Li, W. Yu, G. Yuan, and Y. Rao, “An effective signal separation and extraction method using multi-scale wavelet decomposition for phase-sensitive OTDR system,” in Sixth International Symposium on Precision Mechanical Measurements. International Society for Optics and Photonics, 8916–89160Z (2013).

10. Y. Shi, Y. Wang, L. Zhao, and Z. Fan, “An Event Recognition Method for Φ-OTDR Sensing System Based on Deep Learning,” Sensors 19(15), 3421 (2019). [CrossRef]

11. Y. Shi, S. Dai, T. Jiang, and Z. Fan, “A Recognition Method for Multi-Radial-Distance Event of Φ-OTDR System Based on CNN,” IEEE Access 9, 143473–143480 (2021). [CrossRef]

12. F. Jiang, H. Li, Z. Zhang, and X. Zhang, “An event recognition method for fiber distributed acoustic sensing systems based on the combination of MFCC and CNN,” in 2017 International Conference on Optical Instruments and Technology: Advanced Optical Sensors and Applications (2018), pp.15–21.

13. Y. Shi, S. Dai, X. Liu, Y. Zhang, X. Wu, and T. Jiang, “Event recognition method based on dual-augmentation for an Φ-OTDR system with a few training samples,” Opt. Express 30(17), 31232–31243 (2022). [CrossRef]

14. J. Wang, Y. Hu, and Y. Shao, “The digging signal identification by the random forest algorithm in the phase-otdr technology,” IOP Conf. Ser.: Mater. Sci. Eng. 394(3), 032005 (2018). [CrossRef]

15. C. Xu, J. Guan, M. Bao, J. Lu, and W. Ye, “Pattern recognition based on enhanced multifeature parameters for vibration events in Φ-OTDR distributed optical fiber sensing system,” Microw. Opt. Technol. Lett. 59(12), 3134–3141 (2017). [CrossRef]

16. J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-Excitation Networks,” IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). [CrossRef]

17. S. V. Davis and P Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., Speech, Signal Process. 28(4), 357–366 (1980). [CrossRef]

18. Y. Shi, Y. Li, Y. Zhang, Z. Zhuang, and T. Jiang, “An Easy Access Method for Event Recognition of Φ-OTDR Sensing System Based on Transfer Learning,” J. Lightwave Technol. 39(13), 4548–4555 (2021). [CrossRef]

19. A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in neural information processing systems, 1097–1105 (2012).

20. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: alexnet-level accuracy with 50x fewer parameters and <0.5mb model size,” in arXiv:1602.07360 (2017).

21. M. Tan and Q. V. Le, “Efficientnet: rethinking model scaling for convolutional neural networks,” in arXiv.1905.11946 (2019).

Number	Event class	Event description
I	Sunny Background	Sunny environment
II	Rainy Background	Rainy environment
III	Walking	A person walking at a constant speed of 1.2m/s above the sensing fiber
IV	Jumping	A person jumping at a constant speed of about once a second above the sensing fiber
V	Water washing	Continuing to flush the water next to the sensing fiber
VI	Beating with a shovel	Tapping earth surface with a shovel at a rate of about once a second near the sensing fiber
VII	Digging with a shovel	Digging with a shovel at a rate of about once a second near the sensing fiber
VIII	Riding bicycle	A person riding a bicycle at a constant speed around the sensing fiber

Type	I	II	III	IV	V	VI	VII	VIII
Training set	1279	2060	1037	674	246	1337	830	725
Validation set	274	441	222	144	52	286	178	155
Test set	275	443	223	146	54	287	179	156
Total number	1828	2944	1482	964	352	1910	1187	1036

Methods	Classification accuracy on test set	Number of network parameter
Using time domain single signal method	81.73%	57.04 M
Using frequency domain single signal method	84.37%	57.04 M
Using MFCC single signal method	94.41%	57.04 M
Using multi-signal RGB layer fusion method	91.98%	57.04 M
Using multi-signal feature fusion method without pruning	97.93%	132.53 M
Using multi-signal feature fusion method with n = 0.33	97.56%	56.67 M

Number of fused features	Classification accuracy
Single- Spatial-temporal domain	92.48%
Single- MFCC	94.59%
Dual- Frequency domain and MFCC	97.28%
Three- Frequency, time domain and MFCC	97.93%
Four- Frequency, time domain, spatial-temporal domain and MFCC	99.11%

Network model	Test accuracy of the proposed method	Test accuracy of MFCC-Only method
AlexNet	97.56%	94.41%
SqueezeNet	95.29%	91.27%
EfficienNet	97.80%	97.38%

Multi-signal feature fusion method with an attention mechanism for the Φ-OTDR event recognition system

Abstract

1. Introduction

2. Methodology

2.1 Φ-OTDR system and data acquisition

2.2 Structure of the recognition method

2.3 Multi-signal feature fusion method

2.4 Attention mechanism pruning method

3. Experiments and results

3.1 Data set preparation

3.2 Comparative experiments

3.3 Experiment on influence degree of each feature

3.4 Experiment of exploring model threshold n setting

4. Discussion

4.1 Signal representation diversity

4.2 Discussion on lightweight network diversity

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (5)

Equations (6)

Optics Express