SiSPRNet: end-to-end learning for single-shot phase retrieval

Qiuliang Ye; Li-Wen Wang; Daniel P. K. Lun; Daniel P. K. Lun; Daniel P. K. Lun

doi:10.1364/OE.464086

1. Introduction

Phase retrieval aims to reconstruct a complex-valued signal from its intensity-only measurements. It is a crucial problem in crystallography, optical imaging, astronomy, X-ray, electronic imaging, etc., because most existing measurement systems can only detect the relative low-frequency magnitude information of the wave fields. The study of the phase retrieval problems originated in the $1970$s. The researchers of the optics community developed many reconstruction algorithms [2–4].

Mathematically, the phase retrieval problem can be described by the following equation:

(1)$$\text{ Find } \mathbf{x} \in \mathbb{C}^{N} \quad \text{ s.t. } \ \mathbf{Y}_{i}=\left|\mathcal{F}\left(\mathbf{I}_{i} \circ \mathbf{x}\right)\right|^{2}, i=1, \ldots, M,$$

where $\mathbf {x}\in \mathbb {C}^N$ is the complex-valued signal of interest, and $\mathbf {Y}$ is its Fourier intensity measurements; $\circ$ and $\mathcal {F}$ refer to elementwise multiplication and Fourier transform operator, respectively. The pre-defined optical masks $\mathbf {I}$ provide the constraints to reduce the ill-posedness of the problem. It is implemented in many different ways. For instance, early-stage phase retrieval algorithms treat the support of the signal as the optical mask [3]. Specifically, $|\mathbf {x}_i| \neq 0$ when $i \in \mathcal {S}$, where $\mathcal {S}$ denotes the support of the signal. However, these early-stage algorithms cannot guarantee the convergence of the optimization process nor provide globally optimal solutions. Recently, pre-defined random optical masks have been used as the constraints to improve the reconstruction performance [5–8]. The random masks can be implemented using a spatial light modulator (SLM) or digital micromirror device (DMD) [6,9]. Although the usage of the random masks can lead to better reconstruction performance, there are several disadvantages. The cost of DMD and SLM is one concern, the error due to the global gray-scale phase mismatch and spatial non-uniformity from the devices is another. Besides, it was empirically shown that about $4 - 6$ measurements are required for exact recovery with random binary masks [7,10]. It increases the measurement time and can affect the quality of the reconstructed image, particularly for dynamic objects.

Recently, data-driven methods, like deep neural networks, have been widely applied to explore the data distributions through extensive training samples [11]. Among them, the so-called Convolutional Neural Network (CNN) that exploits the features based on convolution kernels has been successfully adopted in image processing tasks, like image restoration [12–14]. Besides, an important technique named attention mechanism that mimics cognitive attention became popular in the CNN structure to enhance some parts of the feature maps and thus improve the performance [15]. Since CNN can learn discriminative representations from large training datasets, researchers have spent effort in using CNN for solving phase retrieval problems [16–19]. For instance, Chen et al. reconstructed signals from Fresnel diffraction patterns with CNN and the contrast transfer function [20]. Kumar et al. proposed the deep unrolling networks that solved the iterative phase retrieval problem with CNN [21]. Uelwer et al. proposed a cascaded neural network for phase retrieval [22]. Besides, the conditional generative adversarial network (GAN), a kind of data-driven generative method based on specific conditions [23], was adopted for reconstructing the phase information from Fourier intensity measurements [24]. Wu et al. also proposed a CNN to reconstruct the crystal structures with coherent X-ray diffraction patterns [25,26]. The above methods adopt supervised learning that trains the networks with the paired ground-truth data. In recent years, training with unpaired or unsupervised (no ground-truth data) datasets has been a popular topic. For example, Fus et al. proposed the unsupervised learning for in-line holography phase retrieval [27], while Zhang et al. developed the GAN-based phase retrieval with unpaired datasets [28]. While the above approaches have achieved some success, phase retrieval with Fraunhofer diffraction patterns (i.e., Fourier intensity) for general applications is still a challenge. It is partly because of the large domain discrepancy between the Fourier plane and image plane; the large variation in data distributions of general applications also introduces much difficulty in designing a model structure that is optimal for different applications. In this paper, we propose a single-shot maskless phase retrieval method named SiSPRNet that is benefited from the advanced deep learning technology. Similar to the existing deep learning approaches, the proposed SiSPRNet only requires a single intensity measurement for each reconstruction. It also does not require extra optical masks to impose constraints on the intensity measurements. The main novelties of SiSPRNet are two folded. First, it contains a new feature extraction unit using the Multi-Layer Perceptron (MLP) as the front end. Traditional deep learning phase retrieval approaches often use a CNN as the front end to extract the features from the Fourier intensity measurement. However, the convolution operation of CNN can only explore a small area of the intensity image at one time. It does not match with the property of the Fourier intensity images where the data are globally correlated. The proposed feature extraction unit starts with an MLP block that consists of three Fully Connected (FC) layers. They allow all pixels of the input intensity image to be considered together for exploring their global correlation. These FC layers have a size smaller than the input image, which can guide the backpropagation process to train the MLP to generate useful features of the reconstructed image while ignoring the less important information, such as noises and outliers. Such a design also reduces the complexity of the network. The proposed feature extraction unit is equipped with a Dropout layer in between the FC layers that mitigates the overfitting problem, which is common to MLP having many learnable parameters. Another novelty of SiSPRNet is that it is equipped with a new self-attention-based phase reconstruction unit to generate the required images from the extracted features. Traditional deep learning phase retrieval methods reconstruct the magnitude and phase images from the extracted features without considering the global correlation of the underlying objects. Most phase retrieval methods are applied to recover the structure of physical objects, which are often structural with global correlation. The proposed phase reconstruction unit is equipped with two UR blocks, in which two self-attention units are introduced to explore the global correlation of the feature maps. They are inserted into a residual learning structure to prevent the weak information flow and vanishing gradient problems due to the complex layer structure of the UR block. Compared with the traditional convolution operation that only considers local features, the self-attention plus residual learning mechanism help improve the phase retrieval performance.

The proposed SiSPRNet is designed to work with general image datasets without assuming a specific application domain. We verified the proposed method on an optical platform to show its practicality. It has demonstrated its generality by achieving state-of-the-art performance on three different datasets of phase-only images and images with linearly related magnitude and phase. It significantly outperforms the existing deep learning Fourier phase retrieval methods. The source codes of the proposed method have been released in Github [1].

The rest of this paper is organized as follows. Section 2. introduces the defocused phase retrieval system we developed in this study and the proposed end-to-end deep learning phase retrieval network SiSPRNet. Section 3. presents the model analysis and simulation results. Section 4. provides the experimental results on an optical system for validating the performance of the proposed method.

2. Proposed method

2.1 Phase retrieval system

As indicated in Eq. (1), a phase retrieval system aims to reconstruct a complex-valued signal $\mathbf {x}$ from its Fourier intensity measurements. The proposed method does not require multiple measurements or optical masks, so $M=1$ and all the values of the mask $\mathbf {I}$ are equal to one in Eq. (1). In this case, the original complex-valued signal $\mathbf {x}$ is reconstructed only from its Fourier intensity. This seemingly impossible task is known to have a solution, although with ambiguities. From [29], it is known that $\mathbf {x}$ can be uniquely defined, except for trivial ambiguities, by its Fourier intensity, with an oversampling factor over $2$ in each dimension, if $\mathbf {x}$ has a finite support and is non-symmetric. The above has an important implication to the optical system required for the proposed phase retrieval method. Figure 1 shows the system we have constructed to implement the proposed method. In the system, we increase the resolution of the CMOS camera to make sure that the captured Fourier intensity image has a sampling rate at least two times in each dimension higher than that of the object image $\mathbf {x}$. As a result, the number of samples in the zero diffraction order of the Fourier intensity image is $762\times 762$, while the number of samples of $\mathbf {x}$ is $128\times 128$. Compared with other deep learning-based phase retrieval approaches that often accept very small intensity measurements (such as $64\times 64$ or even $28\times 28$ pixels), we choose to use normal-sized intensity measurements ($762\times 762$ pixels) to allow a sufficiently large oversampling ratio and object images of relatively large size ($128\times 128$ pixels). It thus enables more applications for the proposed SiSPRNet. From the $762\times 762$ pixels intensity measurement, we extract the central $128\times 128$ pixels for feeding to the proposed SiSPRNet model. Such a choice is optimal in balancing between complexity and performance, as shown in the ablation analysis in Section 3.2.1.

Fig. 1. Optical path of the defocus phase retrieval system. $\mathbf {x}$ represents the complex-valued object, and $\mathbf {Y}$ is its Fourier intensity measurement; $\mathbf {X} = \mathcal {F}\left (\mathbf {I}_{i} \circ \mathbf {x}\right )$, $\mathbf {H}$ and $\ast$ denote the Fourier plane in complex-valued form, the defocus function generated via Fresnel diffraction and the convolution operation, respectively.

Methods	MAE $↓$		PSNR $↑$		SSIM $↑$		Inference Time (ms/img)
Methods	Sim.	Exp.	Sim.	Exp.	Sim.	Exp.	Inference Time (ms/img)
GS [2]	0.961	1.136	15.252	12.816	0.247	0.126	$1.273 \times 10^{3}$
HIO [3]	1.024	1.221	14.718	12.188	0.233	0.127	$1.116 \times 10^{3}$
ADMM-TV [44]	0.783	1.066	16.574	13.637	0.341	0.128	$1.573 \times 10^{4}$
PlainNet [12]	0.975	1.180	13.841	12.479	0.425	0.377	1.583
ResNet [12,42]	0.894	0.812	15.178	15.831	0.456	0.471	2.438
ResDenseNet [43]	1.084	1.010	13.344	13.800	0.380	0.387	1.732
LenslessNet [19]	0.671	0.704	17.573	17.132	0.598	0.596	7.396
PRCGAN [24]	0.951	0.945	14.738	14.833	0.511	0.508	0.983
NNPhase [25]	0.755	0.931	16.753	15.416	0.558	0.544	5.834
SiSPRNet (Ours)	0.570	0.582	19.204	18.905	0.663	0.663	3.569

Methods	MAE $↓$		PSNR $↑$		SSIM $↑$		Parameters (Millions)	Complexity (GFLOPs)
Methods	Sim.	Exp.	Sim.	Exp.	Sim.	Exp.	Parameters (Millions)	Complexity (GFLOPs)
GS [2]	1.567	1.678	9.841	9.162	0.088	0.052	NA	NA
HIO [3]	1.601	1.719	9.583	8.915	0.086	0.054	NA	NA
ADMM-TV [44]	1.086	1.504	13.404	9.747	0.221	0.125	NA	NA
PlainNet [12]	0.589	0.601	16.916	16.672	0.629	0.623	0.17	2.81
ResNet [12,42]	0.289	0.304	21.394	20.806	0.739	0.726	0.17	2.88
ResDenseNet [43]	0.381	0.419	19.406	18.252	0.675	0.657	0.17	2.83
LenslessNet [19]	0.179	0.221	26.784	23.833	0.842	0.802	20.54	2.49
PRCGAN [24]	1.598	1.602	10.102	10.126	0.422	0.417	111.20	0.11
NNPhase [25]	0.404	0426	23.956	22.778	0.739	0.443	18.85	6.49
SiSPRNet (Ours)	0.127	0.144	29.209	28.132	0.884	0.872	19.30	1.38

Methods	MAE $↓$		PSNR $↑$		SSIM $↑$		Parameters (Millions)	Complexity (GFLOPs)
Methods	Sim.	Exp.	Sim.	Exp.	Sim.	Exp.	Parameters (Millions)	Complexity (GFLOPs)
GS [2]	1.567	1.678	9.841	9.162	0.088	0.052	NA	NA
HIO [3]	1.601	1.719	9.583	8.915	0.086	0.054	NA	NA
ADMM-TV [44]	1.086	1.504	13.404	9.747	0.221	0.125	NA	NA
PlainNet [12]	0.589	0.601	16.916	16.672	0.629	0.623	0.17	2.81
ResNet [12,42]	0.289	0.304	21.394	20.806	0.739	0.726	0.17	2.88
ResDenseNet [43]	0.381	0.419	19.406	18.252	0.675	0.657	0.17	2.83
LenslessNet [19]	0.179	0.221	26.784	23.833	0.842	0.802	20.54	2.49
PRCGAN [24]	1.598	1.602	10.102	10.126	0.422	0.417	111.20	0.11
NNPhase [25]	0.404	0426	23.956	22.778	0.739	0.443	18.85	6.49
SiSPRNet (Ours)	0.127	0.144	29.209	28.132	0.884	0.872	19.30	1.38

Abstract

1. Introduction

2. Proposed method

2.1 Phase retrieval system

2.2 Deep learning model: SiSPRNet

2.2.1 Feature extraction

2.2.2 Phase reconstruction

2.2.3 Post-processing block

2.2.4 Loss function

3. Simulation results

3.1 Simulation setup

3.2 Ablation study

3.2.1 Impact of input size

3.2.2 Impact of defocus distance

3.2.3 Impact of Dropout layer

3.2.4 Impact of fully connected layer

3.2.5 Impact of self-attention unit

3.2.6 Impact of residual units

3.2.7 Cross-validation: model robustness

3.3 Simulation results

4. Experimental results

4.1 Experimental setup

4.2 Experimental results

5. Conclusion

Appendix

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (16)

Tables (2)

Equations (5)

Optics Express