High-generalization deep sparse pattern reconstruction: feature extraction of speckles using self-attention armed convolutional neural networks

Yangyundou Wang; Yangyundou Wang; Yangyundou Wang; Zhaosu Lin; Hao Wang; Chuanfei Hu; Hui Yang; Min Gu; Min Gu; Min Gu

doi:10.1364/OE.440405

1. Introduction

Imaging through scattering media is a classical inverse problem, especially in computational optics [1,2] and biomedical optics [3,4]. The main challenge in inverse problems is the forward operator, which includes the effects of the scattering medium. Several attempts have been made to characterize random scatter efficiently. One method is the transmission matrix, which measures a forward operator including an accessible and static diffuser, i.e., the Green's function between the incident field and detector [5–8]. As a model-based method, it suffers from a limited space-bandwidth product (SBP). The other method is to characterize statistical similarities through speckle correlation, known as the memory effect (ME) [9,10]. The ME defines an angular distribution range within which the speckle pattern does not change, but only translates over a distance. The generalized ME model provides a complete description of these combined shift/tilt correlations within the scattering media, and it extracts the high-order shift-invariant information of a dynamic diffuser. However, it is intolerant to speckle decorrelation with a limited ME range [11–13]. In other words, the optimal reconstruction solution to the ill-posed inverse scattering problem remains challenging.

As a direct forward modeling method, deep learning (DL) was recently implemented in computational imaging (CI), and it provided high-quality solutions to several CI problems. For imaging through diffusers, DL mainly investigates the properties of the scattering medium for speckle reconstruction. The U-net architecture IDiffNet, first proposed by S. Li et al., realized speckle image reconstruction [14]. Y. Li et al. demonstrated a network for scalable diffusers with different microstructures [15]. Lyu et al. presented a hybrid neural network model that can reconstruct sparse objects hidden behind a white polystyrene slab 3 mm in thickness [16]. A class-specific GAN reconstruction network was built by Sun et al. and used for dynamic scattering media [17]. Zheng et al. proposed a deep convolution neural network for feature extraction of nonstatic and turbid media [18]. PDSNet, which was proposed by E. Guo et al., achieved a range 40 times that of the ME range, especially for sparse pattern reconstruction [19].

A major limitation of existing DL approaches is their limited generalization ability. As convolutional and recurrent operations both process a local neighborhood, it is difficult to extract the global or long-range dependency of the speckle pattern, which leads to less effective feature extraction with limited generalization ability. The self-attention (SA) mechanism has achieved significant improvements in non-local modeling for video classification tasks, object detection and segmentation, pose estimation, etc. [20–22]. It can directly model the long-range dependencies in feature maps and capture the global correlation features and dependencies. Moreover, the SA armed convolutional neural network (CNN) can redistribute the weights of the key features in the former layer and pass them into the next layer of the network, thus strengthening and classifying the representation of the input speckle features.

Here, we develop the “one” model, which not only sufficiently encompasses the first- and second-order statistical properties of speckle features across “all” scalable diffusers and detecting positions considered in our experiments, but also generalizes to “all” the considered imaging conditions with various sparse object types, unseen diffusers, and untrained detecting positions. We quantitatively evaluate the “one-to-all” SACNN model with four scientific indicators, namely Pearson correlation coefficient (PCC), structural similarity measure (SSIM), Jaccard index (JI), and peak signal-to-noise ratio (PSNR).

In this study, we demonstrate the generalization performance of SACNN in the context of sparse feature extraction. The model is trained and tested using the MNIST handwritten digits [23]. The trained model is then validated with various sparse object types, namely NIST handwritten letters [24], Quickdraw objects [25], unseen diffusers, and untrained detecting positions. Compared with the conventional CNNs, the SACNN shows better performance with SSIM, and JI and PCC increased by more than 10%. The PSNR for the SACNN was improved by more than 1 dB. Moreover, SACNN showed a good balance between computational complexity and performance.

2. Method

2.1 Optical imaging system

The experimental setup shown in Fig. 1 includes a spatial light modulator (SLM) (Thorlabs EXULUS-HD2 pixel size 8 µm, 1920×1200). The central 800×800 pixels of the SLM are illuminated by the filtered and collimated CW laser at 632.8 nm. The phase-only SLM was calibrated using the Twyman-Green interferometer. The sparse pattern uploaded on the SLM is a blazed grating with an 8.3° deviation angle overlaid with a binary sparse object (pixel by pixel). Glass diffusers with four grit types (Thorlabs DG10-120-MD, DG10-220-MD, DG10-600-MD, DG10-1500-MD) were positioned at the conjugated plane of the SLM sequentially. To match the pixel size of the CMOS camera (Thorlabs DCC1645C, pixel size 3.6 µm, 1280×1024) with that of the SLM, we built a 4F system using two lenses L1 (f = 300 mm) and L2 (f = 125 mm). For the training, testing, and validation dataset collection, the CMOS camera was placed at a distance of 50 mm from the focal plane of lens 3.

Fig. 1. Experiment setup of the speckle correlation imaging system.

Download Full Size | PDF

2.2 Data acquisition

To collect the training and testing datasets, we used 600 MNIST handwritten digits. We experimentally achieved an SBP of 800×800 pixels using up to 7200 training pairs. Our training and testing datasets were designed as shown in Fig. 2(a). This includes four cases, as follows:

Fig. 2. Overview of the training/testing and validation of the SACNN for sparse pattern reconstruction under various conditions. (a) The construction for the training/testing data sets which are captured at 0 mm (T1), 20 mm (T2), and 40 mm (T3) away from the focal plane. (b) During the training/testing stage, the SACNN is trained with MNIST handwritten digits as labels and speckle patterns, i.e., T1, T2, and T3 as inputs. (c) The construction of the validation dataset with three different cases, i.e., seen diffusers, unseen MNIST handwritten digits, and undetected planes (V1); unseen diffusers and unseen MNIST handwritten digits (V2); and seen diffusers, unseen NIST handwritten letters, and Quickdraw objects (V3). (d) During the validation stage, the validation datasets are used to evaluate the reconstruction and generalization abilities of the SACNN.

Download Full Size | PDF

Case 1: Train/test the network with the speckles produced by a 120 grits diffuser at 0, 20, and 40 mm away from the image plane.

Case 2: Train/test the network with speckles produced by a 220 grits diffuser at 0, 20, and 40 mm away from the image plane.

Case 3: Train/test the network with speckles produced by a 600 grit diffuser at 0, 20, and 40 mm away from the image plane.

Case 4: Train/test the network with speckles produced by a 1500 grit diffuser at 0, 20, and 40 mm away from the image plane.

For the validation of the network, the following three cases were included, as shown in Fig. 2(c).

Case 1: Validate the network over speckle decorrelation owing to the varied and extended imaging depth of range. It consists of 2400 validation pairs produced by four seen diffusers with 100 unseen MNIST handwritten digits at 0, 10, 20, 30, 40, and 50 mm away from the image plane.

Case 2: Validate the network over speckle decorrelation owing to the change in diffusers. It consists of 400 validation pairs collected from the unseen 100 MNIST handwritten digits.

Case 3: Validate the network over speckle decorrelation owing to the change in unseen objects. It consists of 416 validation pairs of NIST handwritten letters and 420 validation pairs of Quickdraw objects.

The training/testing and validation data streams are shown in Figs. 2(b) and 2(d), respectively.

2.3 SACNN implementation

The encoder–decoder structure of the SACNN with single-layer SA and double-layer SA was investigated, and the network architecture is shown in Fig. 3. In the network, skip connections are used to transfer information directly between layers of the same width. Dense blocks include multiple layers, and each layer consists of batch normalization (BN), rectified linear unit (ReLU) nonlinear activation, and convolution (Conv) with a growth rate equal to 16. In the encoder, four dense blocks were used with 6, 12, 24, and 16 layers. Three dense blocks were used in the decoder with 24, 12, and 6 layers. To avoid overtraining and overfitting of the network, i.e., decrease the complexity of the network, the dropout layer [26] in the dense block was used as a weight pruning method. The nonlinearity function ReLU in each dense block was used to improve the generalization ability of the network. Max pooling and average pooling decrease the dimensions of the input images and further extract the core features. Nearest-neighbor interpolation was implemented for up-sampling and reconstruction of the sparse features of the MNIST handwritten digits. The network was trained and tested for 100 epochs using the root mean square optimizer. Cross entropy and negative Pearson correlation coefficient were chosen to evaluate the training and testing loss of the network, respectively. The output layer, namely Softmax of SACNN, consists of a two-channel object based on grayscale reconstruction: predict object and predict background. For the single-layer SACNN, the SA block is inserted behind the last dense block of the encoder to strengthen the global speckle correlation at a low level. For the double-layer SACNN, another SA block is implemented immediately behind the first convolutional layer for high-level dependency extraction. The details of the single-layer and double-layer SACNN are listed in Tables S1 and S2 in Supplement 1.

Fig. 3. SACNN architecture for extracting statistical properties of speckle patterns. The SA layer(s) (marked in dark blue) is implemented for single-layer SACNN and double-layer SACNN. Starting with a high-resolution input speckle pattern, the encoder gradually condenses the lateral spatial information (size marked in black) into high-level feature maps with growing depths (size marked in purple); the decoder reverses the process by recombining the information into feature maps with gradually increased lateral details; the output consists of a two-channel object, and background pixel-wise prediction.

Download Full Size | PDF

The data stream of the SA block can be expressed as follows:

(1)$${S_{att}} = Softmax({maxpool({f{{ {({{F_g},{\omega_f}} )} )}^T} \cdot maxpool({g({{F_g},{\omega_g}} )} )} } )\cdot maxpool({h({{F_g},{\omega_h}} )} ), $$

where ${F_g}$ is the input of the pipeline to which three convolutional layers are applied to transform ${F_g}$ into different feature spaces, and ${\omega _f},\; {\omega _g}$, and ${\omega _h}$ are the parameters of each convolutional layer.

The output of the SA block is obtained with an element-wise addition of ${S_{att}}$ and ${F_g}$ as follows:

(2)$${F_{att}} = \; \gamma \cdot a({{S_{att}},{\omega_a}} )+ \; {F_g}, $$

where ${\omega _a}$ is the parameter of the $1 \times 1$ convolutional layer, and $\mathrm{\gamma }$ is a trainable parameter that is used to adjust the enhancement of flexibility for ${F_g}$.

2.4 Data processing

The speckle patterns are first normalized between 0 and 1 by dividing each image by 255. The sparse labels are binary values. To reduce the parameters of the network and the demand for GPU and training data, the input speckle patterns were first downsampled from 800 × 800 pixels to 256 × 256 pixels using a bilinear interpolation approach. To better reconstruct the sharp edges of the handwritten digits, we considered three upsampling methods, namely deconvolution, bilinear interpolation, and nearest-neighbor interpolation. Bilinear interpolation is a resampling technique used in computer vision and image processing. Its interpolation function has two variables (x and y) on a rectilinear 2D grid. Nearest-neighbor interpolation is an approach that approximates a non-given point with points around (neighboring) that point. Compared with bilinear interpolation, the reconstructed images using deconvolution are more obscure. Even though the bilinear interpolation method shows a high signal-to-noise ratio on the reconstructed images, the edge contrast is less sharp compared with nearest-neighbor interpolation. A comparison of the three upsampling methods and their corresponding predictions is presented in the Supplementary notes and the related figure in Supplement 1. Here, we choose the nearest-neighbor interpolation as the upsampling method. As mentioned in [15], the deep convolutional neural network can distill statistical information and filter out the interpolation-induced noise. The learning rate was 10⁻⁵ for the first 35 epochs, 10⁻⁶ for the subsequent 35 epochs, and 10⁻⁷ for the last 30 epochs. Once the network is trained, each prediction is performed in real time.

3. Results

To quantitatively analyze the testing performance of SACNN explicitly, PCC, JI, SSIM, and PSNR for the testing dataset were calculated, as shown in Table 1. The PCC is essentially a normalized measurement of covariance, with a value of 1 representing a perfect correlation. The SSIM evaluates the similarity between the reconstructed patterns and the related ground truth. It is a decimal value between 0 and 1, where 1 represents perfect structural similarity and 0 indicates no structural similarity. JI gauges the similarity and diversity between the prediction and its ground truth. It is widely used in the field of computer science. The PSNR was used to quantify the quality of the reconstructed sparse pattern. It is a logarithmic quantity using the decibel scale: the higher the PSNR, the better the reconstructed image.

Table 1. Comparison of the testing performance of the trained SACNN and CNN

View Table

The testing performance of the trained CNN, single-layer SACNN, and double-layer SACNN are listed in Table 1. The testing results of the SACNN were slightly better than those of the CNN. However, the generalization abilities of the SACNN show far better reconstruction performance.

For validation case 1, we validated the networks with trained and untrained detection positions using unseen MNIST handwritten digits. The performances of the CNN and single-layer and double-layer SACNNs are shown in Fig. 4. It is worth noting that, compared with CNN, the minimum values of PCC, JI, and SSIM for the SACNN are improved by more than 11%, 10%, and 11%, respectively. The maximum and minimum values for PSNR are increased by approximately 1 dB and 1.5 dB, respectively. Moreover, the SACNN can expand the detection range of the system with the detection position 50 mm away from the focal plane.

Fig. 4. Performance of the trained SACNN (single-layer and double-layer SA) and CNN for validation case 1. The average PCC (a), JI (b), and SSIM (c) for the unseen digits with the detecting position from 0 mm to 50 mm away from the focal plane.

Download Full Size | PDF

An intuitive visualization of the JI score is shown in Fig. 5. The SACNN shows good reconstruction performance for untrained detection distances, i.e., 10 mm, 30 mm, and 50 mm. Moreover, the 50 mm detecting position is even outside the training/testing range.

Fig. 5. The ground truth and prediction of the trained SACNN (single-layer and double-layer SA) and CNN with the camera placed 10 mm, 30 mm, and 50 mm away from the focal plane. The prediction results are overlaid with the true positive (white), false positive (green), and false negative (purple).

Download Full Size | PDF

The indicators for validation cases 2 and 3 are presented in Fig. 6. For validation case 2, the image can be reconstructed for the unseen MNIST handwritten digits formed by the unseen diffusers set. For validation case 3, the predictions for unseen NIST handwritten letters and Quickdraw objects with the seen diffuser set were evaluated. It can be seen that the values for PCC, JI, and SSIM using the SACNN increased by more than 14%, 14%, and 10%, respectively. The value of PSNR increased by up to 2.5 dB.

Fig. 6. Quantitative analysis of the trained SACNN (single-layer and double-layer SA) and CNN for MNIST handwritten digits, NIST handwritten letters, and Quickdraw objects. Each bar represents the mean value of PCC, JI, and SSIM for each type of sparse pattern. Each error bar is the standard deviation of PCC, JI, and SSIM.

Download Full Size | PDF

Figure 7 shows the intuitive visualization of the JI score for a single-layer SA (the first and second columns of each category) and double-layer SA (the third and fourth columns of each category) networks. The ground truth is labeled for the three categories, i.e., MNIST handwritten digits, NIST handwritten letters, and Quickdraw objects. The related predicted result is further broken down into the true positive (white), false positive (green), and false negative (purple). Among them, the predictions for MNIST handwritten digits are reconstructed for the condition of an unseen diffuser set with the same macroscopic properties as the training/testing diffuser set.

Fig. 7. The ground truth and related prediction for the unseen MNIST handwritten digits, NIST handwritten letters, and Quickdraw objects of the trained SACNN (single-layer and double-layer SA) and CNN. To better visualize the result, the prediction results are overlaid with the true positive (white), false positive (green), and false negative (purple).

Download Full Size | PDF

The cross-correlation coefficients for the 120 grits diffuser are shown in Fig. 8. The ME range of the system can be expressed as 2×p×δp/M, where p is the pixel size of the CMOS camera, δp is the offset pixel number of the image plane, and M is the magnification of the system. We determined the ME range for the 120 grit diffuser by using a cross-correlation coefficient equal to 0.5 as the threshold. After calculation, the spare objects we implemented on the SLM were found to be 75 times beyond the ME range for the 120 grits diffuser, which is 6×6 pixels.

Fig. 8. Cross-correlation coefficient of the 120 grits diffuser used in the speckle correlation imaging system.

Download Full Size | PDF

4. Conclusion

We have proposed a high-generalization “one-to-all” sparsity object reconstruction method, i.e., SACNN, for various sparse feature reconstructions. Compared with CNN, SACNN shows high generalization performance on various unseen sparse pattern types, untrained diffuser sets, and detecting positions with significantly improved values of the scientific indicators. Moreover, the network breaks the limitation of speckle decorrelation to a certain extent. For strong scattering, the performance of SACNN can be further improved by using denoising or edge enhancement algorithms. We believe that SACNN can be further applied to complex tissue imaging to boost the contrast and resolution, especially in wide-field and large-depth-of-range angiographic imaging.

Funding

China Postdoctoral Science Foundation (2020M671169); Zhangjiang National Innovation Demonstration Zone (ZJ2019-ZD-005).

Acknowledgments

Min Gu acknowledges the funding support from the Zhangjiang National Innovation Demonstration Zone (ZJ2019-ZD-005). Yangyundou Wang is supported by a fellowship from the China Postdoctoral Science Foundation (2020M671169). We thank Steffen Schoenhardt for helpful discussions on the manuscript.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

The datasets presented in this paper are available from the corresponding author upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. J. N. Mait, G. W. Euliss, and R. A. Athale, “Computational imaging,” Adv. Opt. Photonics 10(2), 409–483 (2018). [CrossRef]

2. G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning for computational imaging,” Optica 6(8), 921–943 (2019). [CrossRef]

3. L. V. Wang and H. Wu, Biomedical Optics: Principles and Imaging (John Wiley & Sons, 2009).

4. M. Gu, X. Gan, and X. Deng, Microscopic Imaging through Turbid Media (Springer, 2015). [CrossRef]

5. P. Pai, J. Bosch, M. Kühmayer, S. Rotter, and A. P. Mosk, “Scattering invariant modes of light in complex media,” arXiv:2010.01075 (2020).

6. I. M. Vellekoop and A. P. Mosk, “Focusing coherent light through opaque strongly scattering media,” Opt. Lett. 32(16), 2309–2311 (2007). [CrossRef]

7. X. Wei, Y. Shen, J. C. Jing, A. S. Hemphill, C. Yang, S. Xu, Z. Yang, and L. V. Wang, “Real-time frequency-encoded spatiotemporal focusing through scattering media using a programmable 2D ultrafine optical frequency comb,” Sci. Adv. 6(8), eaay1192 (2020). [CrossRef]

8. G. Huang, D. Wu, J. Luo, Y. Huang, and Y. Shen, “Retrieving the optical transmission matrix of a multimode fiber using the extended Kalman filter,” Opt. Express 28(7), 9487–9500 (2020). [CrossRef]

9. J. Bertolotti, E. G. van Putten, C. Blum, A. Lagendijk, W. L. Vos, and A. P. Mosk, “Non-invasive imaging through opaque scattering layers,” Nature 491(7423), 232–234 (2012). [CrossRef]

10. G. Osnabrugge, R. Horstmeyer, I. N. Papadopoulos, B. Judkewitz, and I. M. Vellekoop, “Generalized optical memory effect,” Optica 4(8), 886–892 (2017). [CrossRef]

11. L. Li, Q. Li, S. Sun, H. Z. Lin, W. T. Liu, and P. X. Chen, “Imaging through scattering layers exceeding memory effect range with spatial-correlation-achieved point-spread-function,” Opt. Lett. 43(8), 1670–1673 (2018). [CrossRef]

12. C. Guo, J. Liu, W. Li, T. Wu, L. Zhu, J. Wang, G. Wang, and X. Shao, “Imaging through scattering layers exceeding memory effect range by exploiting prior information,” Opt. Commun. 434, 203–208 (2019). [CrossRef]

13. X. Wang, X. Jin, J. Li, X. Lian, X. Ji, and Q. Dai, “Prior-information-free single-shot scattering imaging beyond the memory effect,” Opt. Lett. 44(6), 1423–1426 (2019). [CrossRef]

14. S. Li, M. Deng, J. Lee, A. Sinha, and G. Barbastathis, “Imaging through glass diffusers using densely connected convolutional networks,” Optica 5(7), 803–813 (2018). [CrossRef]

15. Y. Li, Y. Xue, and L. Tian, “Deep speckle correlation: a deep learning approach toward scalable imaging through scattering media,” Optica 5(10), 1181–1190 (2018). [CrossRef]

16. M. Lyu, H. Wang, G. Li, S. Zheng, and G. Situ, “Learning-based lensless imaging through optically thick scattering media,” Adv. Photo. 1(03), 1 (2019). [CrossRef]

17. Y. Sun, J. Shi, L. Sun, J. Fan, and G. Zeng, “Image reconstruction through dynamic scattering media based on deep learning,” Opt. Express 27(11), 16032–16046 (2019). [CrossRef]

18. S. Zheng, H. Wang, S. Dong, F. Wang, and G. Situ, “Incoherent imaging through highly nonstatic and optically thick turbid media based on neural network,” Photonics Res. 9(5), B220–B228 (2021). [CrossRef]

19. E. Guo, S. Zhu, Y. Sun, L. Bai, C. Zuo, and J. Han, “Learning-based method to reconstruct complex targets through scattering medium beyond the memory effect,” Opt. Express 28(2), 2433–2446 (2020). [CrossRef]

20. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 7794–7803.

21. H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” arXiv:1805.08318 (2018).

22. C. Hu and Y. Wang, “An efficient convolutional neural network model based on object-level attention mechanism for casting defect detection on radiography images,” IEEE Trans. Ind. Electron. 67(12), 10922–10930 (2020). [CrossRef]

23. Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST database of handwritten digits,” http://yann.lecun.com/exdb/mnist/.

24. National Institute of Standards and Technology, “NIST Special Database 19,” https://www.nist.gov/srd/nist-special-database-19.

25. Google, “The Quick Draw Dataset,” https://quickdraw.withgoogle.com/data.

26. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res. 15(1), 1929–1958 (2015).

Indicator	CNN	SACNN (Single-layer SA)	SACNN (Double-layer SA)
PCC	0.918	0.944	0.931
JI	0.615	0.695	0.667
SSIM	0.834	0.876	0.864
PSNR (dB)	17.029	18.386	17.613

High-generalization deep sparse pattern reconstruction: feature extraction of speckles using self-attention armed convolutional neural networks

Abstract

1. Introduction

2. Method

2.1 Optical imaging system

2.2 Data acquisition

2.3 SACNN implementation

2.4 Data processing

3. Results

4. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (8)

Tables (1)

Equations (2)

Optics Express