Surpassing the diffraction limit using an external aperture modulation subsystem and related deep learning method

Zhiqiang Wang; Zhiqiang Wang; Dan Zhang; Na Wang; Jinping He; Jinping He

doi:10.1364/OE.432507

1. Introduction

The resolution of conventional imaging systems is inherently restricted by the Abbe-Rayleigh diffraction limit, $\Delta = 0.61\lambda \textrm{/}NA$, which relates the minimal spatially resolvable spacing $\Delta $ to the optical wavelength $\lambda$ and the numerical aperture $NA$ of the imaging system [1]. To surpass this diffraction barrier, termed as super-resolution (SR) in this paper, different methods have been researched in the past few decades, such as stimulated emission-depletion microscopy [2], scanning near-field optical microscopy [3], photoactivated localization microscopy [4], stochastic optical reconstruction imaging (STORM) [5], structured illumination imaging [6] and so on.

As an alternative solution for optical SR imaging, data-driven approaches using deep learning network (DLN) have been reported recently with high empirical success [7,8,9]. Different from those using iterative optimization algorithms [10,11,12], the DLN-based SR methods can offer a well-trained, non-iterative reconstruction tool that rapidly conducts resolution enhancement, without the need for estimation of a point spread function (PSF) or numerical modeling of the imaging process [8]. To the best of our knowledge, a DLN-based SR method is firstly proposed by Rivenson et al. [13] in wide-field microscopic imaging. An “end-to-end” strategy is used for training the DLN and the specimens are imaged twice separately, once with a 40×/0.95NA objective lens to acquire the low-resolution (LR) input data and again with a 100×/1.4NA oil immersion objective lens to acquire the high-resolution (HR) label data. A resolution enhancement of about 1.28-times is achieved over a larger field of view (FOV) and depth of field. DLN-based methods have since been demonstrated to improve the resolution of fluorescence microscopy [14,15], coherent imaging [16] and STORM imaging [17,18], among others. However, the image acquisition method by switching the objective lens hinders the possibility of obtaining training set of moving objects. The LR and HR images of different regions of the same specimens are acquired separately, leading to a quite long time-interval. In addition, it needs to consider the matching problem of both the NA and the magnification of different objective lenses when building the DLN architecture, which limits the flexible realization of various DLN architectures and further affects the underlying SR performance. If more input data or label data are wanted, i.e., more objective lenses being used, these problems can be even more complicated.

To address these issues, an external aperture modulation subsystem (EAMS) and related DLN method is proposed in this paper. The EAMS facilitates the realization of various image acquisition strategies and related DLN architectures. With a 3-aperture modulation strategy, better SR performance can be achieved compared with the commonly used 2-aperture modulation case. In addition to wide-field microscopy, this entire framework might be combined to other existing and intending imaging systems to further improve the imaging performance and broaden the scope of application, with compact setup and low cost.

2. Methods

2.1 Experimental setups

As depicted in Fig. 1(a), the experimental setup contains a commercial wide-field microscope and the EAMS. The EAMS integrates a collimating lens L1, a variable iris, an imaging lens L2 and a camera. In the experiment, the image plane of the microscope is set as the front focal plane of the collimating lens L1 (${f_1} = {400_{}}\textrm{mm}$), and then re-imaged by the imaging lens L2 (${f_2} = {200_{}}\textrm{mm}$) on a camera (FLIR BFS-U3-63S4M-C, $3072 \times 2048$) with a pixel size of ${2.4_{}}\mathrm{\mu }\textrm{m}$. A motorized iris diaphragm (MID, SmarAct GmbH, SID-18) is applied to modulate the aperture size of the imaging system, as shown in Fig. 1(b). This diaphragm can be placed at any pupil plane of the system [19], and it is set at the pupil plane between the lens L1 and L2 in our experiment. The wide-field microscopic images are acquired using a 20×/0.4NA objective lens, and the illumination light is filtered to a narrow band spectrum centered at ${\sim} {530_{}}\textrm{nm}$ for better performance. The biological samples used in the experiment are the longitudinal sections of maize seed, with a thickness of ${5_{}}\mathrm{\mu }\textrm{m}$ and staining with hematoxylin, as shown in Fig. 1(c).

Fig. 1. The experimental setups. (a) Schematics of the imaging system and the EAMS. (b) The motorized iris diaphragms. (c) The longitudinal sections of maize seed used in our experiments. A ruler is put aside as a reference.

Download Full Size | PDF

The magnification of the whole imaging system is 10 × . To mitigate the influence of aberrations introduced by the lens L1 and L2 and to acquire the experimental SR image for comparison, the aperture size at the pupil plane is not fully used. With the maximal aperture size of ${D_{\max }} = {7.5_{}}\textrm{mm}$ used in our experiments, the diffraction-limited resolution and the “effective” pixel size are calibrated to be ${\sim} {1.25_{}}\mathrm{\mu }\textrm{m}$ and ${\sim} {0.24_{}}\mathrm{\mu }\textrm{m}$, using a microscopic high-resolution chart (Edmund, High Resolution Microscopy Star Target #37-538, 7.5∼3300 lp/mm). By using aberrations corrected lens L1 and L2, the EAMS can fully achieve the diffraction-limited resolution of the original imaging system, i.e., ${\sim} {0.8_{}}\mathrm{\mu }\textrm{m}$. Note that, we only use one objective lens in our experiment. For other objectives lenses of the microscope, we consider it as a new imaging system, and fresh implementation of the whole framework is suggested.

Compared to the image acquisition method by switching the objective lens [13–16], the EAMS used in this paper can achieve a fast and consecutive aperture modulation. As a result, various DLN architectures can be implemented with different inputs and outputs/labels combinations, such as one-to-one, one-to-many, many-to-one or many-to-many. In this paper, a 3-aperture modulation strategy with single input data and double label data DLN architecture is taken as an example. The single input data is the LR images acquired with the aperture size ${D_{LR}} = {2.5_{}}\textrm{mm}$, and the double label data are the moderate resolution images and the HR images acquired with the aperture sizes ${D_{MR}} = {5_{}}\textrm{mm}$ and ${D_{HR}} = {7.5_{}}\textrm{mm}$, respectively. All images are automatically acquired and adequately sampled.

2.2 Deep learning network architecture

With the aperture modulation strategy mentioned above, a DLN architecture backboned with deep pyramidal cascaded channel attention residual transmitting (CART) blocks is proposed and named dpcCARTs-Net for short. As schematically demonstrated in Fig. 2, the dpcCARTs-Net is constituted of three integral components, i.e., primitive feature extraction, deep pyramidal cascaded CART blocks, and residual reconstruction. Assuming the LR input, the HR label and its corresponding output are represented by x, ${y_1}$ and ${\hat{y}_1}$, respectively. The moderate resolution label ${y_2}$ and its corresponding output ${\hat{y}_2}$ are an optional operation (plotted with dashed connecting lines in Fig. 2), and the 3-aperture modulation strategy can be conveniently transformed to the commonly used 2-aperture modulation strategy by just disconnecting the moderate resolution label ${y_2}$ and its ${\hat{y}_2}$. Noting that, x, ${y_1}$ and ${y_2}$ are the pre-processed data of the corresponding images for fast convergence during the training process.

Fig. 2. The schematic diagram of the detailed network architecture of the dpcCARTs-Net. The element-wise addition operation in CART block serves as a short skip connection, and the one from the network input to the output serves as a long skip connection. The element-wise multiplication operation is fulfilled using a broadcast mechanism provided by the deep learning platform. All the input/label images are pseudo-color images and normalized to 1.

Download Full Size | PDF

To formally illustrate the model implementation, let $C$ being a convolutional layer. Here, we use only one convolutional layer to extract the primitive features ${f_0}$ as

(1)$${f_0} = {\rm C} (x ).$$

Then, ${f_0}$ is passed to the deep pyramidal cascaded CART blocks component, which includes N CART blocks with gradually increased feature channels. The $k\textrm{ - th}$ CART block is termed as ${\textrm{H}_{\textrm{cart}\_\textrm{k}}}$ for simplicity, and we can get the extracted deep features ${f_{deep}}$ as

(2)$${f_{deep}} = {\textrm{H}_{\textrm{cart}\_\textrm{N}}}({\cdot}{\cdot} \cdot ({\textrm{H}_{\textrm{cart}\_\textrm{k}}} \cdot{\cdot} \cdot ({\textrm{H}_{\textrm{cart}\_\textrm{1}}}({f_0})))).$$

Following that, ${f_{deep}}$ is passed to the residual reconstruction component, which is also composed of only one convolutional layer, to predict the residuals between the network input and output. Finally, the output of dpcCARTs-Net can be expressed as

(3)$$\hat{y} = {\rm C} ({f_{deep}}) + x.$$

CART block: The channel-wise features are treated equally in the previous study [13], which may be not suitable for the SR imaging related applications. To focus on more informative high-frequency features, a channel attention mechanism [20] is introduced into the dpcCARTs-Net to weight the residual features of different frequency-bands, as illustrated at the bottom of Fig. 2. The input of each CART block is firstly passed through a convolutional layer embedding with a rectified linear unit (ReLu) activation $\Gamma $ to get feature maps $F = [{F_1},{F_2},\ldots ,{F_c}]$, which has c feature channels of size h × w. Let F be an input of the channel attention unit, and then the channel-wise statistic $z \in {{\mathbb R}^c}$ can be obtained by shrinking through the spatial dimensions h × w using global average pooling operation ${\textrm{H}_{\textrm{GP}}}({\bullet} )$. The $k\textrm{ - th}$ element of a channel descriptor z is determined by

(4)$${z_{1 \times 1 \times c}} = {\textrm{H}_{\textrm{GP}}}({F_k}) = \frac{1}{{h \times w}}\sum\limits_{i = 1}^h {\sum\limits_{j = 1}^w {{F_k}(i,j)} } ,$$

where ${F_k}(i,j)$ represents the value at position $(i,j)$ of $k\textrm{ - th}$ feature ${F_k}$. Next, a simple gating mechanism with sigmoid function $\sigma$ is used to fully capture channel-wise dependencies from the aggregated information [20]. As the feature channels c is gradually increased in each CART block of the dpcCARTs-Net, the channel number in the middle of the channel attention unit is determined by $\Lambda (c/r)$, where r is a constant reduction ratio and $\Lambda ({\bullet} )$ returns an integer not less than the element inside the parentheses. Then we obtain the final channel statistics

(5)$${s_{1 \times 1 \times c}} = \sigma ({W_U}\Gamma ({W_D}{z_{1 \times 1 \times c}})),$$

where ${W_D}$ and ${W_U}$ are the weights of channel-downscaling and channel-upscaling convolution operation with reduction ratio r. The re-weighted feature channels $\hat{F}$ can be obtained using a broadcasting method in keras,

(6)$${\hat{F}_c} = {F_c} \times {s_{1 \times 1 \times c}}.$$

${\hat{F}_{}}$ is then passed through the second convolutional layer embedding with a ReLU, and a short skip connection is achieved by adding the input of the CART block to its output. The operation demonstrated above benefits to adaptively and progressively learn the high-frequency information features across a wide range as the CART block increased, which can help to improve the resolution and the image fidelity simultaneously.

The channel increase formula of the CART block and the short skip connection is the same as that in [13]. When the total number $N$ of the residual block is the same, the network depth of dpcCARTs-Net does not increase significantly. But the dpcCARTs-Net can provide a wider receptive field, which is potential to induce better SR performances. Besides, a long skip connection is introduced to offer a clear path for low-frequency information flow, which may empower the dpcCARTs-Net to learn more informative high-frequency residuals and stabilize the training of very deep network at the same time.

Loss (Cost) function: The dpcCARTs-Net is trained by minimizing the loss function expressed as

(7)$$L = {L^{MSE}} + {\lambda _1} \cdot {L^{MAE}} + {\lambda _2} \cdot (1 - {L^{SSIM}}),$$

${L^{MSE}}$ is the mean-squared error (MSE, i.e., L2-norm) as a data-fidelity term. ${L^{MAE}}$ is the mean absolute error (MAE, i.e., L1-norm) as a weight regularizer to promote the weights sparsity and objects’ spatial sparsity simultaneously [7]. since L2 and L1-norm are both pixel-wise, $(1 - {L^{SSIM}})$, a transformation of structural similarity image measure (SSIM) index [21], is introduced as another regularizer. The SSIM will take into account the correlations of adjacent pixels, which are prevalent in nature objects, and guarantee the network output is natural [8]. ${\lambda _1}$ and ${\lambda _2}$ are the regularization parameters.

Implementation details: The implementation details for training the dpcCARTs-Net are given as follows. The feature channel number of the primitive feature extraction convolutional layer is set as 32, and that of the residual reconstruction is either one for grayscale images or three for RGB images. The total number of the CART blocks is $N = 8$, and the moderate resolution label and the HR label are corresponding to the 5^th and 8^th CART block respectively. In all convolutional layers except for those in the channel attention unit and the augmenting convolutional layer whose size is set as 1 × 1, the kernel size is set as 3 × 3. Zero padding is used to maintain the same size of the feature channels. The reduction ratio in the channel attention unit is chosen as $r = 8$. Noting that, the EAMS only changes the numerical aperture of the optical imaging system, keeping its FOV and magnification the same. As a result, images with different resolution occupy the same pixels in our method, and the follow-up image registration operation in [13] are no longer needed before the training process.

Noise in experimentally acquired images will affect the training and inference process of the DLN. In addition to image extraction and overlapping, image patch normalization, data augmentation and those required by the deep learning platform, i.e., a TensorFlow (v2.1.1) backend Keras (v2.2.4-tf), a simple pre-filtering denoise operation is taken to all the images, which can be expressed as

(8)$${g_f} = |{{F^{ - 1}}(F(g) \times {P_{cutoff}})} |,$$

where F and ${F^{ - 1}}$ are the Fourier transform and its inverse. ${P_{cutoff}}$ is a circular low-pass filter, the radius of which is the theoretical cutoff frequency when the image g is acquired. Images with different resolution will have specific pre-filtering operation. But in all cases, the pre-filtering operation just blocks unreal higher frequency components introduced by noise, and it has no influence on the inherent frequency components of the image. All the dpcCARTs-Nets in this paper are trained 100 epochs using an Adam optimizer with default parameters of the deep learning platform, and values of ${\lambda _1}$ and ${\lambda _2}$ are optimized and empirically chosen. The batch size is set as 32 for 64 × 64 pixels sub-images. The training and validation process are implemented on a standard workstation equipped with a 64 GB RAM, an Intel Xeon Gold 5118, 2.30 GHz CPU, and a Nvidia Quadro P2000 GPU with 5 GB memory.

2.3 Super-resolution extrapolation process

Unlike previous studies that use the LR images as inputs during both the training process and the test/inference process, we use the LR images as inputs during the training process, but the HR images acquired with ${D_{\max }}$ as inputs during the test/inference process to surpass the diffraction limit of optical imaging system. As the resolution enhancement process works like an extrapolation operation, the test/inference process is named as the super-resolution extrapolation process in this paper, and the output of the dpcCARTs-Net is called as extrapolated SR output.

However, the HR images can’t be directly inputted to the well-trained dpcCARTs-Net in our work. The EAMS only changes the NA of the imaging system, keeping its FOV and magnification the same. As a result, the LR and the HR images, acquired using the same imaging detector, occupy the same pixels. The sampling rate discrepancy between the LR and HR images should be firstly addressed. An image magnification operation is proposed, and the whole super-resolution extrapolation process in detail is as follows: A h × w pixels HR image is firstly magnified m times to match the sampling rate, and then the magnified (m × h)×(m × w) pixels image is passed to the well-trained dpcCARTs-net to rapidly infer an (m × h)×(m × w) pixels image which surpasses the diffraction limit of the imaging system, as shown in Fig. 3. Here m is corresponding to the ratio of resolution between the LR and HR images. In this paper, m = 3 is used for comparing the SR performance of the dpcCARTs-net trained with 2-aperture and 3-aperture modulation strategy as they both have the same extrapolated SR output ${\hat{y}_1}$. When m = 3, the extrapolated SR output ${\hat{y}_2}$ of the 3-aperture modulation strategy is incorrect and ignored in this paper. To obtain the correct extrapolated SR output ${\hat{y}_2}$, m = 2 should be used.

Fig. 3. The schematic diagram of the proposed super-resolution extrapolation process.

Download Full Size | PDF

Noting that, the image magnification operation only enlarges the sampling rate of the HR image and has little effects on its resolution. A bi-cubic interpolation method embedding with a ReLU operation is adopted to numerically magnify the HR images and ensure all the values are non-negative in this paper. Other advanced numerical interpolation methods may also work for this purpose, but this is beyond the scope of this paper. In practice, an optimal choice is to use a physical magnification operation by a combination of lenses or an objective lens, with the same numerical aperture but m times magnification.

To give a clear explanation of the extrapolation from HR to SR is a tough task as the interpretability of DLN remains a big challenge to related researches. In our understanding, the information learned by the DLN is the neural network's parameters $\Theta $, not the specified frequency component discrepancy between the LR and HR images. The well-trained DLN can map a LR image to a HR one, which means it has a trend to enlarge the frequency range of the inputted image. This trend was termed as the scale-invariance of the image transformation from lower-resolution images to higher-resolution ones by Rivenson et al. [13], and has been used by inputting the HR image to the DLN to get a better resolution performance. Besides, the output of the DLN is decided not only by $\Theta $, but also the input to the DLN. In our work, the HR image is firstly magnified m times to match the sampling rate of the LR image. In the view of image process, the magnified HR image will be recognized as a “LR” image by the DLN and the neural network's parameters $\Theta $ will transform it to a more resolvable one. The following numerical simulations and experiments in this paper further verify our hypothesis.

3. Results

3.1 Numerical validation

Firstly, the performance of dpcCARTs-Net is investigated by numerical simulations. We follow the linear and space invariant model for imaging, and the parameters of the imaging system for simulation are set as the same as those demonstrated in Section 2.1. Randomly generated point sources are used as the objects to be imaged. These randomly generated point sources are ideal point that occupies only one pixel of the object plane, i.e., ${0.24_{}}\mathrm{\mu }\textrm{m}$, and these point sources are convolved with different 10× magnified PSF to “virtually” acquire different resolution images under the aperture size ${D_{LR}} = {2.5_{}}\textrm{mm}$, ${D_{MR}} = {5_{}}\textrm{mm}$ and ${D_{HR}} = {7.5_{}}\textrm{mm}$, respectively. As the objects may be naturally distributed with varying sparsity in real cases, without loss of generality, the sparsity in our simulation meets a uniform distribution between 0.1% and 10%. Here the sparsity means the concentration of ideal point sources. Figure 4 gives an example of images of the sparsest and densest cases. The sparsity difference is up to 100 times, and the HR image of the densest case can hardly be distinguished with any structures, which arises a big challenge for training the network. The main purpose of the numerical simulations in this section is to serve as a Prototype verification, so noise is not considered here. The influence of noise will be discussed later.

Fig. 4. Example images of randomly generated point sources with (a) 0.1% sparsity $({D_{LR}} = {2.5_{}}mm)$, (b) 0.1% sparsity $({D_{HR}} = {7.5_{}}mm)$, (c) 10% sparsity $({D_{LR}} = {2.5_{}}mm)$, and (d) 10% sparsity $({D_{HR}} = {7.5_{}}mm)$ in simulation. All images are pseudo-color images and normalized to 1.

Download Full Size | PDF

The initial 256 × 256 pixels image is cut into 64 × 64 pixels sub-images with no overlapping, and each sub-image is normalized separately. With many random realizations, we finally generate 32000 groups of training data, 3200 groups of validation data and 3200 groups of test data, respectively. Then, the training data and validation data are fed into the dpcCARTs-net for training. The test date is used to further adjust the training parameters, and the regularization parameters of the loss function are finally chosen as ${\lambda _1} = 0.005$ and ${\lambda _2} = 0.00001$ empirically, considering the sparse characteristics of point source objects. The other training parameters are set as default values of the deep learning platform. The full training process takes ∼12 hours and ∼14.5 hours for the 2-aperture and 3-aperture modulation strategy, respectively.

Two-point pairs with decreasing spacing are used to calibrate the resolution enhancement ability. Figure 5(a) is the interpolated HR image, which corresponds to the diffraction-limited imaging result of the imaging system. In Fig. 5(a), only the top two-point pair can be just resolved according to Rayleigh criterion. As the spacing of two-point pairs is decreased step by step, the resolution enhancement ability need reach $S\textrm{ = }1:0.1:2$ (in the left column from top to bottom) and $S\textrm{ = 2}.1:0.1:3$ (in the right column from top to bottom) times of the diffraction limit to resolve the corresponding two-point pair. Figure 5(a) is inputted to the well-trained dpcCARTs-Net, and the extrapolated SR output of the dpcCARTs-Net trained with 2-aperture and 3-aperture modulation strategy are shown in Fig. 5(c) and Fig. 5(d), respectively. The interpolated 3 times theoretically super-resolved image of these two-point source pairs is given in Fig. 5(b), which is used for comparison and references. It can be clearly seen that the HR input is super-resolved in both cases, and the SR performance of the dpcCARTs-Net trained with 3-aperture modulation strategy is better than that of the dpcCARTs-Net trained with 2-aperture modulation strategy. To give some quantitative analysis of the SR performance, firstly, the full width at half maxima (FWHM) of the just resolved single points in Fig. 5(c) and Fig. 5(d) are calculated, which is ${0.56_{}}\mathrm{\mu }\textrm{m}$ and ${0.47_{}}\mathrm{\mu }\textrm{m}$, respectively. The corresponding resolution is ∼2.23 times and ∼2.66 times smaller than the diffraction-limited resolution of ${1.25_{}}\mathrm{\mu }\textrm{m}$. Figure 5(e) and Fig. 5(f) gives the cross-sectional profiles of the two-point pair corresponding to the resolution enhancement ability of 2.2 times and 2.7 times, better illustrating a stronger resolving ability of the dpcCARTs-Net trained with 3-aperture modulation strategy. Then, two commonly used metrics, i.e., MSE and SSIM, are computed for evaluating the image fidelity of the extrapolated SR images. Using the interpolated 3 times theoretically super-resolved image in Fig. 5(b) as a reference, the MSE value of images in Fig. 5(a), 5(c) and 5(d) is 0.016, $5.82 \times {10^{ - 4}}$ and $2.75 \times {10^{ - 4}}$ respectively, and the SSIM value of corresponding images is 0.494, 0.962 and 0.985 respectively. The extrapolated SR output of the dpcCARTs-Net shows a significant improvement in image fidelity compared to the HR input, and the dpcCARTs-Net trained with 3-aperture modulation strategy performs better than that trained with 2-aperture modulation strategy. All these results indicate that both the resolution enhancement ability and the image fidelity of the dpcCARTs-Net can be improved by just adding one label data.

Fig. 5. The results of the well-trained dpcCARTs-Net on the two-point source pairs. The interpolated results of (a) the diffraction-limited HR input, (b) the 3 times theoretically super-resolved image. The extrapolated SR output of the dpcCARTs-net trained with (c) 2-aperture modulation strategy and (d) 3-aperture modulation strategy. The cross-sectional profiles of the resolution enhancement of (e) 2.2 times and (f) 2.7 times. Here 0.735 is the normalized intensity of the saddle-point according to Rayleigh criterion. (a)-(d) are pseudo-color images and normalized to 1. These two-point source pairs are equally arranged in longitudinal direction with a separation distance of 4.8 μm.

Download Full Size | PDF

Next, the influence of point-source sparsity on the SR extrapolation process is investigated. The HR images of random examples with 10%, 1% and 0.1% sparsity are chosen as the inputs, as shown in Fig. 6(a)–6(c). The quantitative resolution of Fig. 6(a)–6(c) is evaluated by calculating the FWHM of the resolvable single point, which is ${\sim} {0.84_{}}\mathrm{\mu }\textrm{m}$, ${\sim} {0.5_{}}\mathrm{\mu }\textrm{m}$ and ${\sim} {0.46_{}}\mathrm{\mu }\textrm{m}$ respectively, and the corresponding theoretically super-resolved images are given as a reference. The zoom-in regions are given in Fig. 6(d)-(o) for a clear comparison. In general, the sparser the HR input is, the better the SR performance will be, and the dpcCARTs-Net trained with 3-aperture modulation strategy demonstrates a better SR performance than the one trained with 2-apertures modulation strategy in all sparsity cases. The dpcCARTs-net works well when the point source sparsity is 0.1% and 1%, and the artifacts are not apparent and can be acceptable. But the artifacts become noticeable when the sparsity is 10%, because the dpcCARTs-net used here is trained with the point source sparsity meets a uniform distribution between 0.1% and 10% under a sparse hypothesis and ${\lambda _2} = 0.00001$ may be not suitable in this case. A better SR performance may be achieved by pretraining and optimizing a set of DLNs for various point source sparsity. Then, in the reconstruction stage, an optimal DLN can be applied to the inputted HR image by estimating its point source sparsity, as suggested by Nehme et al. [17].

Fig. 6. The influence of point-source sparsity on the SR extrapolation process of the dpcCARTs-net. An example diffraction-limited HR input of (a) 10% sparsity, (b) 1% sparsity and (c) 0.1% sparsity. (d, e, f) are the corresponding zoom-in area of (a, b, c). The corresponding zoom-in area of the extrapolated SR of the dpcCARTs-net trained with (g, h, i) 2-aperture modulation strategy and (j, k, l) 3-aperture modulation strategy. The referenced (m) 1.5 times, (n) 2.5 times and (o) 2.7 times theoretically super-resolved image of (d, e, f). All images are pseudo-color images and normalized to 1.

Download Full Size | PDF

For some applications, the point-source distributions may be controlled under certain sparsity level. To address this issue, the influence of point-source sparsity on the training process is further studied in detail. The dpcCARTs-net is separately trained using randomly generated point sources with the same sparsity. Using the aforementioned two-point source pairs in Fig. 5, the resolution enhancement ability of the dpcCARTs-Net trained with the 2-aperture and the 3-aperture modulation strategy is calculated and compared in Fig. 7. We can find that the resolution enhancement ability of the dpcCARTs-Net generally decreases from 3 to 1.5 times of the diffraction limit as the point-source sparsity varies from 0.1% to 10%. However, the SR ability of the dpcCARTs-net trained with 3-aperture modulation strategy is always better than that of the 2-aperture modulation strategy when the sparsity is higher than 1%, i.e., when the point-source sparsity becomes denser.

Fig. 7. The influence of the point-source sparsity on the training of the dpcCARTs-net.

Download Full Size | PDF

The combination of results in Fig. 5, Fig. 6 and Fig. 7 indicates that the DLN trained from lower-resolution images to higher-resolution ones follows a complicated self-learning process. Except for the features related to resolution enhancement, other features (contrast, structure characteristics, etc.) mapping can also be learned, and the output of the DLN is the combination of the prediction of all the features. For sparser objects, the simple and symmetric structure may promote the learning of features related to resolution enhancement. While for objects with complex structure, the attention of the DLN will be distracted to learn other features. Consequently, the resolution enhancement ability learned by the dpcCARTs-Net trained using denser data is relatively less than that using sparser data during the training process, but an approaching 1.5 times resolution enhancement can still be learned by the dpcCARTs-Net trained with the 3-aperture modulation strategy when the point source sparsity is 10%, as shown in Fig. 7. Correspondingly, all the features of the HR input will be considered by a trained dpcCARTs-net during the SR extrapolation process. When the HR input itself is denser, the prediction of other features will become more noticeable, which will weaken the resolution enhancement and result in a worse SR performance as shown in Fig. 6(j). This is an important issue for further research. Given the better SR performance of the dpcCARTs-Net trained with the 3-aperture modulation strategy, DLN architecture with more inputs or labels can thus be suggested to increase the robustness and reduce the artifacts of the outputs of DLN-based SR methods.

Noting that, the training parameters used in this section are default values of the deep learning platform. A better SR performance may be achieved by careful hyperparametric optimization, such as giving different weights to the loss function of the moderate resolution label and HR label of the 3-aperture modulation strategy. Besides, the regularization parameters of the loss function for different point-source sparsity are set the same, the extrapolated SR output of dpcCARTs-Net trained by denser point-source distribution may be improved by using optimized regularization parameters.

3.2 Experimental results

To train the dpcCARTs-net with real biological specimens, 10 different thin longitudinal sections of maize seed are used as the imaging objects. The “effective” areas of experimentally acquired images are further augmented by rotating 3 times and extracting into $64 \times 64$ pixels sub-images with a randomly selected overlapping between 35% and 45%. We finally obtain 32000 groups of image patches, 99% of which are randomly selected to train and validate the dpcCARTs-net, i.e., 90% for training and 9% for validating the network model during the training process. A shuffle operation in taken during each epoch to avoid overfitting or underfitting of the final model. The remaining 1% formed the test images and the regularization parameters are finally optimized as ${\lambda _1} = 0.001$ and ${\lambda _2} = 0.001$ empirically, considering the complexity of the biological specimens.

After the dpcCARTs-net is well trained, a totally new longitudinal section of maize seed is used to test the SR performance of the dpcCARTs-net on biological specimens. Figure 8(a) demonstrates one example of the extrapolated SR output of the dpcCARTs-net trained with 3-aperture modulation strategy. For the convenience of comparison, the zoom-in regions of interest (ROIs) of the diffraction-limited HR input, the extrapolated SR output of the dpcCARTs-net trained with 2-aperture and 3-aperture modulation strategy, and the 1.5 times experimentally super-resolved images as reference, are given in Figs. 8(b)–8(m). The 1.5 times experimentally super-resolved images are acquired by modulating the aperture size of the MID to 11.25 mm. Compared to the diffraction-limited HR input, the extrapolated SR output of the dpcCARTs-net presents an evident improvement both in resolution and contrast, and the SR performance of the dpcCARTs-net trained with 3-aperture modulation strategy is also better than that of 2-aperture modulation strategy. More specifically, Figs. 8(n) and 8(o) give the cross-sectional profiles of two white and black feature spots, which are marked with the same boxes in Fig. 8(h), better illustrating the finer spatial resolution improvement provided by the dpcCARTs-net trained with 3-aperture modulation strategy. The cross-sectional profiles also indicate that the peak-to-peak distance of the two feature spots in Fig. 8(n) and Fig. 8(o) is ${0.9_{}}\mathrm{\mu }\textrm{m}$ and ${1_{}}\mathrm{\mu }\textrm{m}$, respectively, which is beyond the diffraction-limited resolution of ${1.25_{}}\mathrm{\mu }\textrm{m}$. Compared to the 1.5 times experimentally super-resolved images in Figs. 8(k)–8(m), the artifacts in the extrapolated SR output of the dpcCARTs-net are not apparent, which is consistent with the output of other DLN-based SR methods [13,14,17]. The smoothness of the extrapolated SR output mainly comes from the pre-filtering operation and the bi-cubic interpolation magnification, the influence of which on the biological structure of the specimen is acceptable in our experiment. The yellow arrows in Figs. 8(i) and 8(j) also point out some much clearer gaps and shapes of the biological structures. These results may benefit to further study the biological information of the specimens. For example, The ROI marked in red box in Figs. 8 display the endosperm region, which is the main place for the accumulation and storage of nutrients in maize seed, and these black spots are the starch grains. As the major storage material, starch accumulation has a great impact on the quality and crop of maize [22]. The extrapolated SR output of the dpcCARTs-net trained with 3-aperture modulation strategy provides a much clearer edge of the starch grains, which will help improve the evaluation accuracy of starch content.

Fig. 8. The results of dpcCARTs-net on a new longitudinal section of maize seed. (a) An example extrapolated SR output image of the dpcCARTs-net trained with 3-aperture modulation strategy. Comparison of zoom-in ROIs: (b, c, d) the diffraction-limited HR input. The extrapolated SR output of the dpcCARTs-net trained with (e, f, g) 2-aperture modulation strategy and (h, i, j) 3-aperture modulation strategy. The cross-sectional profile of the two (n) white and (o) black feature spots, with the same box as marked in (h). The yellow arrows in (i, j) point to some blurred gaps and shapes that are brought clearer by the well trained dpcCARTs-net. All images are normalized to 1.

Download Full Size | PDF

4. Discussion

The numerical simulations in Section 3.1 use a noiseless imaging model. To investigate the influence of noise on the SR ability of the dpcCARTs-Net, white Gaussian noises are added to different resolution images, which are acquired in the same way as these used for training the dpcCARTs-Net in Fig. 5. The signal-to-noise ratio (SNR) is defined as the ratio of the signal power to the noise power. In this paper, we use the SNR of the HR images to represent the noise level. As the signal power of image decreases with the decrease of aperture size, while the noise power remains the same, images with lower resolution will thus have lower SNR. All the images are processed using the pre-filtering denoise operation. The SR ability of the dpcCARTs-Net trained by the 2-aperture and the 3-aperture modulation strategy are calculated and plotted in Fig. 9(a). The results demonstrate that the influence of noise on the SR ability of the dpcCARTs-Net decreases as the SNR increases. When SNR is bigger than 50, the SR ability of the dpcCARTs-Net is very close to that in Fig. 5, thus the influence of noise can be ignored, Fig. 9(b) and Fig. 9(c) are the extrapolated SR output of the dpcCARTs-Net trained with the 2-aperture and 3-aperture modulation strategy when SNR=10. The images of point source pair in Fig. 9(b) are surrounded by apparent artifacts, while those in Fig. 9(a) are very neat, which indicates that the dpcCARTs-Net trained with the 3-aperture modulation strategy is more robust to distortions, such as noise. These results further support the main conclusions derived from numerical simulations in Section 3.1.

Fig. 9. The influence of noise on the SR ability of the dpcCARTs-net. (a) The relationship between the SR ability and SNR. The extrapolated SR output of the dpcCARTs-Net trained by (b) the 2-aperture and (c) the 3-aperture modulation strategy when SNR=10. The dpcCARTs-nets are trained with the point source sparsity meets a uniform distribution between 0.1% and 10%.

Download Full Size | PDF

Compared with the commonly used 2-aperture modulation strategy in previous methods [13–16], the better performance of the method demonstrated in this paper, i.e., 3-aperture modulation strategy together with related dpcCARTs-net, can be attributed to the reasons listed below. (1) By adding one label data, we use different resolution images at the corresponding level as supervision, which guides the DLN training to progressively learn sub-band residuals at different frequency components, similar as that in [23]. As a result, the 3-aperture modulation strategy has a greater power to predict complicated mappings and effectively reduces undesired artifacts. (2) The introduced channel attention mechanism in CART blocks help adaptively learn the informative high-frequency features across a wide range, which further improve the resolution and the image fidelity simultaneously.

The EAMS demonstrated in this paper plays key roles both in fast and consecutive aperture modulation. Though the generalization of the DLN-based SR method has been demonstrated, such as a DLN trained on one type of specimen might be applied to other types [13], using the same type of specimen for training and prediction is always recommended for getting optimal results. Compared to previous methods in [13–16], the fast aperture modulation ability of EAMS facilitates a rapid acquisition of different resolution images, which enables new opportunities for the training data collection and further SR imaging of moving targets. The opening velocity of the MID used in our experiment is about 6 mm/s, and it takes ∼1 s to acquire a group of 3 images with different resolution. To meet the imaging requirements of moving objects, the micro-electro-fluidic technology [24] based variable aperture with fast response, i.e., 80 mm/s, may be used in the future. The consecutive modulation of EAMS will make multi-aperture modulation strategy and related DLN architectures to be easily achieved, which can help further improve the SR performance. Examining the benefits of more introduced inputs and labels are planned directions of further studies. But the SR performance improvement is not a straightforward extension. It’s important to note that the attention of the DLN will be distracted when more inputs or labels are involved. A sophisticated hyperparametric optimization and a deeper network with careful design are encouraged.

To practice our method on different optical imaging systems or other types of specimens, fresh implementation of the whole framework and sophisticated hyperparametric optimization are suggested for better SR performance. Two issues should be addressed to build the EAMS. First, the aberrations introduced by the two lenses need be reduced as much as possible. Second, the variable iris need be well tuned with its center and position matching that of the pupil plane, otherwise the aperture modulation process won’t work accurately and the FOV of the original imaging system will be influenced. In case of related DLNs only needs one input data, such as the dpcCARTs-net used in this paper, the imaging camera and the resolution can be the same as that of the original optical imaging system. Once the DLN is well trained, the EAMS can then be removed. The well-trained DLN can be integrated as an application program interface and called by the camera to fulfil a “quasi real-time” imaging because of the intrinsically fast inference characterization of DLN-based methods. Combined with advanced image segmentation algorithm [25], the SR imaging and displaying speed of ROIs can be further increased. In addition, “self-feeding” the output of the DLN as its new input can be further used to improve the resolution of the image, but it is only effective in the first few cycles [13].

The training strategy can also be optimized. For example, Fig. 10(a) gives an image of point-sources with 0.75% sparsity, which is separately inputted to the dpcCARTs-Net trained by point-sources with 0.75% and 0.25% sparsity, respectively. The corresponding extrapolated SR output are shown in Fig. 10(b) and Fig. 10(c). Interestingly, the extrapolated SR output of dpcCARTs-Net trained using point-sources with 0.25% sparsity demonstrates a much higher performance both in resolution enhancement ability and the image fidelity, as clear shown in the zoom-in regions in Fig. 10(d)–10(f). This might occur because the inputs to the SR extrapolation process are the HR images, which will be recognized by the dpcCARTs-Net as a m = 3 times sparser object than the training inputs, i.e., the LR images. That is, for a given point-source sparsity, a better strategy is to train the DLN using a relatively sparser point-source distribution. This empirical finding maybe of practical significance for applications such as DLN-based STORM imaging. Instead of using the same emitter density for training and inference [17,18], a DLN trained using a relatively sparser emitter density can be used to infer the image of the denser one, which can be used to further improve or balance the spatial and temporal resolution.

Fig. 10. The SR performance comparison of the dpcCARTs-net trained using different point-source sparsity. (a) The diffraction-limited HR input with 0.75% sparsity. The extrapolated SR output of dpcCARTs-Net trained using randomly generated point sources with (b) 0.75% sparsity and (c) 0.25% sparsity. (d, e, f) are the corresponding zoom-in area of (a, b, c). All images are pseudo-color images and normalized to 1.

Download Full Size | PDF

5. Conclusion

In this paper, the feasibility of using an external aperture modulation subsystem and related deep learning method to surpass the diffraction-limited of conventional imaging system is demonstrated numerically and experimentally. Compared to commonly used 2-aperture modulation strategy, a higher performance both in resolution enhancement ability and the image fidelity is achieved. The method proposed here provides more operability in realization of various image acquisition strategies and related DLN architectures, which may serve as a more general way to further improve the SR performance of DLN-based approaches. Particularly, this framework can be implemented as an add-on module to existing and intending imaging systems with compact setup and low cost. It also offers a flexible and practical solution for the training data collection and further SR imaging of label-free moving objects, such as living cells. These results further support the idea of using DLN-based methods, as an alternative or a better solution, for a wide variety of inverse problems in computational imaging.

Funding

National Natural Science Foundation of China (11903062, 11773045, 11933005); CAS Pioneer Hundred Talents Program.

Acknowledges

We appreciate the reviewers for their valuable comments and suggestions.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. J. W. Goodman, Introduction to Fourier Optics (Roberts & Company Publishers, 2005).

2. S. W. Hell and J. Wichmann, “Breaking the diffraction resolution limit by stimulated emission: stimulated-emission-depletion fluorescence microscopy,” Opt. Lett. 19(11), 780–782 (1994). [CrossRef]

3. B. Hecht, B. Sick, U. P. Wild, V. Deckert, and D. W. Pohl, “Scanning near-field optical microscopy with aperture probes: fundamentals and applications,” J. Chem. Phys. 112(18), 7761–7774 (2000). [CrossRef]

4. E. Betzig, G. Patterson, R. Sougrat, O. Lindisser, S. Olenych, J. Bonifacino, M. Davidson, J. Lippincott-Schwartz, and H. Hess, “Imaging intracellular fluorescent proteins at nanometer resolution,” Science 313(5793), 1642–1645 (2006). [CrossRef]

5. B. Huang, W. Wang, M. Bates, and X. Zhuang, “Three-dimensional super-resolution imaging by stochastic optical reconstruction microscopy,” Science 319(5864), 810–813 (2008). [CrossRef]

6. M. G. L. Gustafsson, “Surpassing the lateral resolution limit by a factor of two using structured illumination microscopy,” J. Microsc. 198(2), 82–87 (2000). [CrossRef]

7. K. de Haan, Y. Rivenson, Y. Wu, and A. Ozcan, “Deep-learning-based image reconstruction and enhancement in optical microscopy,” Proc. IEEE 108(1), 30–50 (2020). [CrossRef]

8. G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning for computational imaging,” Optica 6(8), 921–943 (2019). [CrossRef]

9. H. B. Yedder, B. Cardoen, and G. Hamarneh, “Deep learning for biomedical image reconstruction: a survey,” Artif. Intell. Rev. 54(1), 215–251 (2021). [CrossRef]

10. M. Hirsch, S. Harmeling, S. Sra, and B. Scholkopf, “Online multi-frame blind deconvolution with super-resolution and saturation correction,” Astron. Astrophys. 531, A9 (2011). [CrossRef]

11. F. M. N. Mboula, J. L. Starck, S. Ronayette, K. Okumura, and J. Amiaux, “Super-resolution method using sparse regularization for point-spread function recovery,” Astron. Astrophys. 575, A86 (2015). [CrossRef]

12. B. Xu, Z. Wang, and J. He, “Beating the Rayleigh limit via aperture modulation,” J. Opt. 23(1), 015701 (2021). [CrossRef]

13. Y. Rivenson, Z. Gorocs, H. Gunaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica 4(11), 1437–1443 (2017). [CrossRef]

14. H. Wang, Y. R. Rivenson, Y. Jin, Z. Wei, R. Gao, H. Günaydın, L. A. Bentolila, C. Kural, and A. Ozcan, “Deep learning enables cross-modality super-resolution in fluorescence microscopy,” Nat. Methods 16(1), 103–110 (2019). [CrossRef]

15. H. Zhang, C. Fang, X. Xie, Y. Yang, W. Mei, D. Jin, and P. Fei, “High-throughput, high-resolution deep learning microscopy based on registration-free generative adversarial network,” Biomed. Opt. Express 10(3), 1044–1063 (2019). [CrossRef]

16. T. Liu, K. De Haan, Y. Rivenson, Z. Wei, X. Zeng, Y. Zhang, and A. Ozcan, “Deep learning-based super-resolution in coherent imaging systems,” Sci. Rep. 9(1), 3926 (2019). [CrossRef]

17. E. Nehme, L. Weiss, T. Michaeli, and Y. Shechtman, “Deep-STORM: super-resolution single-molecule microscopy by deep learning,” Optica 5(4), 458–464 (2018). [CrossRef]

18. E. Nehme, D. Freedman, R. Gordon, B. Ferdman, L. E. Weiss, O. Alalouf, T. Naor, R. Orange, T. Michaeli, and Y. Shechtma, “DeepSTORM3D: dense 3D localization microscopy and PSF design by deep learning,” Nat. Methods 17(7), 734–740 (2020). [CrossRef]

19. B. Xu, Z. Wang, and J. He, “Super-resolution imaging via aperture modulation and intensity extrapolation,” Sci. Rep. 8(1), 15216 (2018). [CrossRef]

20. Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European conference on computer vision (IEEE, 2018), pp. 286–301.

21. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process 13(4), 600–612 (2004). [CrossRef]

22. T. M. Baye, T. C. Pearson, and A. M. Settles, “Development of a calibration to predict maize seed composition using single kernel near infrared spectroscopy,” J. Cereal Sci. 43(2), 236–243 (2006). [CrossRef]

23. W. S. Lai, J. B. Huang, N. Ahuja, and M. H. Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,”. in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp.5835–5843.

24. J. Chang, K. Jung, E. Lee, M. Choi, S. Lee, and W. Kim, “Variable aperture controlled by microelectrofluidic iris,” Opt. Lett. 38(15), 2919–2922 (2013). [CrossRef]

25. S. De, S. Bhattacharyya, S. Chakraborty, and P. Dutta, Hybrid Soft Computing for Multilevel Image and Data Segmentation. (Springer, 2016), Chap. 1.

Surpassing the diffraction limit using an external aperture modulation subsystem and related deep learning method

Abstract

1. Introduction

2. Methods

2.1 Experimental setups

2.2 Deep learning network architecture

2.3 Super-resolution extrapolation process

3. Results

3.1 Numerical validation

3.2 Experimental results

4. Discussion

5. Conclusion

Funding

Acknowledges

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Equations (8)

Optics Express