Autofocus methods based on laser illumination

Zhijie Hua; Xu Zhang; Xu Zhang; Dawei Tu

doi:10.1364/OE.499655

1. Introduction

Microscopes are commonly used in industrial production to detect and identify defects such as surface defects and cracks. Due to their small size, defects require observation at high magnification and can only be clearly seen at the ideal focal plane. However, manual adjustment to find the ideal focal plane is inefficient and has low accuracy. To overcome this challenge, autofocus systems can automatically adjust the focal plane within a short time, ensuring high-quality images at high magnification. Additionally, autofocus systems can quickly detect and identify defects and their locations in the sample, improving detection accuracy and consistency. Therefore, autofocus systems play a crucial role in industrial microscopic detection.

A variety of autofocus strategies in the industry have been developed, which can be divided into two categories: range-based autofocus and natural-image-based autofocus. Range-based autofocus utilize dedicated hardware components, such as laser [1], ultrasonic rangefinders [2] or special cameras [3,4], to automatically adjust the focus of the objective lens. Instead of relying on external sensors, natural-image-based autofocus uses algorithms to analyze image sharpness [5–12] and iteratively optimize focus position [13–15]. Nowadays, learning-based methods have emerged as a powerful approach, which assumes an optical defocus model. Such methods build network models to learn and predict the optimal defocus distance based on previously captured images or data. In general, the network models can be assigned to classification models and regression models. Classification models input a focal stack [16], multiple slices [17], or a single slice [18–20] and then the defocus distance is obtained from the predicted category label with the highest score. Regression models omit the intermediate step compared to the classification models, and directly establish a connection between the image and the defocus distance to achieve end-to-end prediction. The input is one or two slices [21–24] or the magnitude of its Fourier transform [25,26], and the output is the estimated defocus distance. Some challenges of such methods include the need for large and diverse datasets, the need to balance speed and accuracy trade-offs in real-time imaging applications, and the small measurable off-focus amount.

In the above methods, range-based autofocus provides high focusing accuracy, but requires external range measurement devices, which can be costly and complexity. Natural-image-based autofocus is often affected when in complex backgrounds, such as defects [27] and noises [28] on the surface of the object being detected, leading to focus drift and blurring in microscopic images. Therefore, active laser illumination has emerged as a new strategy [29,30] to achieve high-precision autofocus, with the main challenge being the extraction of laser spot features. The laser spot is projected onto the surface of various objects, and the resulting spot pattern can be affected by factors such as object texture, material, defects, and height differences, as shown in Fig. 1. When the spot pattern is out of focus, the complex background overlaps significantly with the split-image pattern, seriously affecting the accuracy of the pattern extraction and leading to large deviations in the final focusing accuracy. Under this direction, LSPM [31] has been proposed. This method restores the split image in a complex background into a split image in a clean background, and then extracts the feature of the spot pattern and predicts the defocus distance. However, such method requires step-by-step to achieve focusing, and the calibrated model itself has errors, which will hinder the further improvement of focusing speed and accuracy. Therefore, developing more powerful autofocus algorithms with high focusing accuracy, speed, and robustness are necessary.

Fig. 1. Different spot patterns under different detection backgrounds.

Download Full Size | PDF

In this paper, a high-precision autofocus method with laser illumination was proposed, termed Laser Split-image Autofocus (LSA). This method uses the laser beam to project the pattern engraved on the glass onto the surface of the detected object, and uses the triangular prism to divide the spot pattern into two portions to better describe the spot pattern features. The common non-learning-based and learning-based methods for LSA were quantitatively analyzed and evaluated. Furthermore, a lightweight comparative framework model for LSA, termed split-image comparison model (SCM), was proposed to more accurately predict the defocus distance from a single defocus image, and a realistic dataset of sufficient size was made to train the model. This model uses a lightweight neural network to extract the spot pattern features of the upper and lower portions separately, and then compares the two features to predict the defocus distance. The experiment demonstrated that SCM has a great improvement compared with previous learning and non-learning methods. It was also showed that LSA has better focusing performance than the natural-image-based method. Note that LSA works with split images instead of natural images, and thus it can be applied to many other fields.

2. Principle of the split-image autofocusing

To better describe the features of the spot pattern projected by the laser beam, two triangular prisms are installed to divide the spot pattern into two portions. By comparing and analyzing the features of two spot patterns, high-precision split-image autofocus is achieved.

The split-image autofocusing [32] relies on the split-image system, which consists of a lens, a pair of triangular prisms called optical wedges, and a pattern mask. The structure of the optical wedges are shown in Fig. 2. The optical wedges have wedge surfaces that interlace to divide the spot pattern into two parts. When the ray bundles pass through the lens, they are reflected into the optical wedges and refracted on the upper and lower surfaces, forming an aerial image. Depending on where the aerial images are created in relation to the intersection of the optical wedges, the system can be categorized into far-focus, in-focus, and near-focus.

Fig. 2. Schematic illustration of the split-image system principle in three situations. (a), (b), and (c) create aerial images at $A, B, C$ positions and virtual images at $A_r$, $A_l$, $C_r$, $C_l$ positions. The red-marked optical wedge is closer to us, and the blue one is farther from us.

Download Full Size | PDF

When in far-focus, as shown in Fig. 2(a), the ray bundles emitted from $A$ enter the front surface of the red optical wedge and undergo refraction from a less dense to a denser medium within the wedge. The refracted bundles then exit the lower surface of the wedge and undergo another refraction from a denser to a less dense medium. Both refractions follow the laws of Fresnel, and the outgoing bundles form a virtual image at $A_l$. The image point $A$ undergoes a similar process when passing through the blue optical wedge, forming a virtual image at $A_r$. These imply that the same image point produces two split images after passing through the optical wedges, and the splitting directions of far-focus and near-focus are exactly opposite. When the split-image system is properly focused, as shown in Fig. 2(b), the ray bundles cluster to a focus, creating an aerial image at the intersection of the optical wedges and making the split-image patterns symmetrical.

Therefore, we can model the relationship between the split images and their corresponding actual defocus distance to quickly bring the focus to the appropriate point. After the relationship between them is established, the defocus distance can be predicted by a single defocus image. The equation is defined as:

(1)$$\Delta z = \Theta\left(I_{k}\right), \forall k \in\{1, \ldots, n\}$$

where, $\Delta z$ is the defocus distance, $\Theta (*)$ is the relationship model between the split image and the defocus distance, $I_{k}$ is a single defocus image.

3. Dataset

Different types of chip surfaces were used as the experimental object and observed under a 10X objective lens. The numerical aperture of the 10X objective lens was 0.3, the depth of field was 3.5$\mathrm {\mu }$m, and the work distance was 8.5mm. The Daheng industrial camera of 5 million pixels, called MER2-503-23GC, was used as an image acquisition device. A laser with a wavelength of 532nm was utilized to project the split-image pattern. The image was taken when the laser was turned on, and the bright field light source was turned off.

240 focal stacks under different detection scenes are captured by the motorized industrial microscope. Each focal stack contains 161 slices from −40$\mathrm {\mu }$m to 40$\mathrm {\mu }$m with a step size of 0.5$\mathrm {\mu }$m. The negative sign represents near-focus, and the positive one is far-focus. The z-axis with 50nm motion resolution can be moved in a step size of 0.5$\mathrm {\mu }$m and continuously triggers the camera to collect images. Each raw image is composed of 2448 $\times$ 2048 pixels in 24-bit. To facilitate the training of data, the region of interest is segmented from the raw image where the split-image pattern is located, and thus each slice’s size is 800 $\times$ 800.

The making process of each focal stack is shown in Fig. 3. The algorithm shown in 4.1 was used to calculate the pixel distance between the upper and lower split-image patterns in each slice. When the pixel distance is the smallest, the slice at this time is considered to be the focus position, and the actual defocus distance is marked as 0$\mathrm {\mu }$m. The equation can be defined as follows:

(2)$$\underset{j}{\arg\min } \Delta \text{ pixel }_{j} \mapsto O_{j}, \forall j \in\{0, \ldots, c\}$$

where $\Delta \text {pixel}_{j}$ is the pixel distance, $O_{j}$ denotes the focused image.

Fig. 3. The making process of each focal stack. (a) Example images of single focal stack. (b) The subpixel distance curve of slices in single focal stack. (c) Industrial microscope. (d) shows 7 slices at the same defocus distance from 3 focal stacks under different detection backgrounds and their corresponding natural images.

Download Full Size | PDF

Then, 80 slices are taken around the focus position, and their actual defocus distances are marked according to the step size.

After making 240 focal stacks, all data are partitioned randomly into training (80%, 192 focal stacks), validation(10%, 24 focal stacks), and testing(10%, 24 focal stacks) sets.

4. Autofocusing methods of split image

4.1 Non-learning-based method

Non-learning-based method uses traditional image processing algorithms to achieve autofocus. According to the above principle of split-image focusing, the upper and lower split-image patterns are symmetrical along the horizontal direction when the split-image system is properly focused. Otherwise, the upper and lower patch patterns are shifted left or right, making them no longer horizontally aligned. Therefore, the relationship between the pixel distance of the centroids of the upper and lower split-image patterns and the actual defocus distance can be modeled to quickly bring the focus to the appropriate point [31]. Non-learning-based method requires two steps to achieve autofocus from a single defocused image: 1) the calculation of pixel distance (similarity model) and 2) the mapping of pixel distance to defocus distance (calibrated model). The overall algorithm flow is shown in Fig. 4(a).

Fig. 4. Non-learning-based Method. (a) The prediction process of non-learning-based method. (b) The workflow of the similarity model. (c) The polynomial curve is fitted to the calibrated model by using the pixel distance and the actual defocus distance.

Download Full Size | PDF

4.1.1 Similarity model

The pixel distance between the centroids of the upper and the lower split-image patterns was estimated by the similarity model [33–35]. However, the similarity cannot be judged through the image’s structure, texture, edge, and other details for defocus images due to the lack of features. Thus, normalized cross correlation (NCC) [32], which regards each pixel as a feature and calculates the correlation between the two feature vectors, was more suitable for this task. NCC can be defined as follows:

(3)$$NCC(x, y) = \frac{\sum_{w \in M} \sum_{h \in N}\left|L(x+w, y+h)-\bar{L}_{x, y}\right||U(w, h)-\bar{U}|}{\sqrt{\sum_{w \in M} \sum_{h \in N}\left[L(x+w, y+h)-\bar{L}_{x, y}\right]^{2} \sum_{w \in M} \sum_{h \in N}[U(w, h)-\bar{U}]^{2}}}$$

(4)$$\bar{L}_{x, y} = \frac{1}{T} \sum_{w \in M} \sum_{h \in N}[L(x+w, y+h)]$$

(5)$$\bar{U} = \frac{1}{T} \sum_{w \in M} \sum_{h \in N} U(w, h)$$

where $M$ and $N$ represent the set of pixel coordinate values selected from the upper split image, $T$ denotes the total number of the set, $L(x,y)$ and $\bar {L}_{x, y}$ are the pixel value and average pixel value of the lower split image, respectively, whereas $U(w, h)$ and $\bar {U}$ signify the pixel value and average pixel value of the upper split image, respectively.

The similarity model is shown in Fig. 4(b). The region where the upper split-image pattern is located was extracted as a template and marked the centroid coordinates of the region as $P_u$. Then, the lower split-image pattern by NCC was located and the centroid was denoted as $P_l$. The pixel distance of the centroids of the upper and lower split-image patterns was calculated according to the formula:

(6)$$\Delta \text{ pixel } = \left(P_{u}-P_{l}\right)_{w}$$

where $P_u$ and $P_l$ refer to the centroid coordinates of the upper and lower split-image patterns, and $*_w$ is the pixel coordinate value along the width direction.

4.1.2 Calibrated model

In non-learning-based method, it is also critical to model the relationship between the pixel distance of the centroids and the actual defocus distance, which connects the pixel distance on a two-dimensional plane with the actual distance in a three-dimensional space. The relationship between two distances is calibrated by polynomial curve fitting. 10 focal stacks with clean background were collected, and the pixel distance and actual defocus distance corresponding to each slice were calculated. The pixel distance is used as the independent variable, and the defocus distance is used as the dependent variable to fit the polynomial curve of each focal stack. Finally, the calibrated model is obtained by averaging all fitted curves. On top of that, the calibration accuracy of the final model was about 0.366$\mathrm {\mu }$m. The polynomial curve of the calibrated model is set as :

(7)$$\begin{aligned}&\Delta z = 1.404 \times 10^{{-}11} \Delta_{\text{pixel }}{ }^{6}+8.425 \times 10^{{-}9} \Delta_{\text{pixel }}{ }^{5}-8.392 \times 10^{{-}8} \Delta_{\text{pixel }}{ }^{4}\\ &\;\;-6.589 \times 10^{{-}5} \Delta_{\text{pixel }}{ }^{3}+4.298 \times 10^{{-}4} \Delta_{\text{pixel }}{ }^{2}+0.9756 \Delta_{\text{pixel }}-0.02759 \end{aligned}$$

4.2 Learning-based method

Learning-based method operate by using an optical defocus model, which establishes a connection between the split-image pattern features from a single image and the defocus distance. This method only needs to input a single defocused split image to predict the defocus distance from the network model. The prediction process is shown in Fig. 5(a). The overall prediction model can be divided into two networks: feature extraction network and feature integration network.

Fig. 5. Learning-based Method. (a) The prediction process of learning-based method. (b) shows the size of the feature map of feature extraction network. (c) shows the size of the feature map of feature integration network.

Download Full Size | PDF

4.2.1 Feature extraction network

Many excellent feature extraction networks exist, such as ResNet [36], MobileNet [37–39], Vision Transformer [40], etc. These networks can fully extract the features of the spot pattern in the split image and represent them with high-level and abstract features, which contain more dimensional information than the low-dimensional image features extracted by the non-learning-based method. And the learning method is used to fully extract the high-dimensional features of the split image under different detection backgrounds, so that it has stronger robustness and is suitable for more detection scenarios.

To accelerate feature extraction to improve autofocus efficiency and reduce memory consumption, depthwise separable convolutional network, which reduces the number of parameters while improving the accuracy and speed of the model, were more suitable for embedded devices where memory and computational resources are limited. The architecture of a depthwise separable convolutional network, which consists of two stages: depthwise convolution and pointwise convolution, has achieved significant success in MobileNet [37–39], GhostNet [41], EfficientNet [42], and so on. Fig. 5(b) lists the size of the feature map when extracting features.

In this method, the input of the feature extraction network is an image of $256 \times 256 \times 3$, and the size of the high-dimensional feature map after five times of down-sampling is $8 \times 8 \times C$. The formula is defined as :

(8)$$T_{8 \times 8 \times C} = E\left(I_{256 \times 256 \times 3}\right)$$

where, $I_{256 \times 256 \times 3}$ represents an image with an input size of $256 \times 256 \times 3$, $E(*)$ denotes a feature extraction network, $T_{8 \times 8 \times C}$ represents a feature map with an output size of $8 \times 8 \times C$, and $C$ is the number of channels of the output feature map.

4.2.2 Feature integration network

Feature integration network is to further process and integrate the features extracted from the previous network to generate the final output. The overall structure change diagram is shown in Fig. 5(c). In this method, the input of the feature integration network is $8 \times 8 \times C$. Firstly, the number of channels is adjusted to $8 \times 8 \times 960$ by $1 \times 1$ convolution, and then the dimension of the feature map is reduced to $1 \times 1 \times 960$ by global average pooling. Finally, the full connection network is used to output the predicted defocus distance. Its formula is defined as:

(9)$$\Delta z = {Int}\left(T_{8 \times 8 \times C}\right)$$

where, $T_{8 \times 8 \times C}$ represents the feature map with input size of $8 \times 8 \times C$, and $C$ is the number of channels. $Int(*)$ denotes the feature integration network. $\Delta z$ represents the predicted defocus distance.

4.3 Proposed method

The learning-based method can extract high-dimensional features from the split image’s spot pattern and integrate them effectively. However, this model simultaneously extracts upper and lower split-image pattern features, leading to the fusion of the two features. This fusion increases the difficulty of the feature integration network in distinguishing between the two features and hinders the improvement of focusing accuracy. Therefore, the SCM was proposed to further improve the accuracy of the predicted defocus distance by extracting the features of the upper and lower split-image patterns separately and comparing their differences.

The structure of SCM is shown in Fig. 6. The split image was divided into two slices, each slice contains only a single split-image pattern, as shown in Fig. 6(a). Two slices are scaled and filled with gray to images of $256 \times 256 \times 3$ as the input of the whole model. SCM mainly includes two networks: feature extraction network and feature comparison network.

Fig. 6. Split-image comparison model (SCM). (a) The prediction process of SCM. (b) shows the size of the feature map of feature comparison network. (c) shows the size of the feature map of feature extraction network.

Download Full Size | PDF

4.3.1 Feature extraction network

The same structure of depthwise separable convolutional network (MobileNet [37–39], GhostNet [41], EfficientNet [42]) as the learning-based method is used to accelerate the extraction of the split-image pattern features of the two slices and reduce memory consumption. Fig. 6(c) lists the feature maps of the split image in the feature extraction process.

In this method, the upper and lower split images undergo feature extraction through weight sharing. The input to the feature extraction network is an image with dimensions of $256 \times 256 \times 3$. Following five rounds of down-sampling, the high-dimensional feature map is reduced to a size of $8 \times 8 \times C$. The number of channels in the feature map is subsequently adjusted, and global average pooling is performed. Finally, a fully connected layer integrates the feature map into a feature vector with dimensions of $1 \times 1 \times 1280$. The formula is defined as:

(10)$$T_{1 \times 1 \times 1280} = E\left(I_{256 \times 256 \times 3}\right)$$

where, $I_{256 \times 256 \times 3}$ represents an image with an input size of $256 \times 256 \times 3$, $E(*)$ denotes a feature extraction network, and $T_{1 \times 1 \times 1280}$ represents a feature vector with an output size of $1 \times 1 \times 1280$.

4.3.2 Feature comparison network

The features of the two split-image patterns are compared. Fig. 6(b) lists the comparative structure of the two split-image pattern features. First, the absolute value of the difference between the two feature vectors of $1 \times 1 \times 1280$ is recorded as :

(11)$${diff}_{1 \times 1 \times 1280} = \left|E\left(I_{256 \times 256 \times 3}^{1}\right)-E\left(I_{256 \times 256 \times 3}^{2}\right)\right|$$

where, $I_{256 \times 256 \times 3}^{1}$ and $I_{256 \times 256 \times 3}^{2}$ are the upper and lower split images respectively, $E(*)$ denotes the feature extraction network, ${diff}_{1 \times 1 \times 1280}$ is the difference of feature vectors.

Then this difference is output to predict the defocus distance after extracting features through three fully connected layers, which is recorded as :

(12)$$\Delta z = Com({diff}_{1 \times 1 \times 1280})$$

where, ${diff}_{1 \times 1 \times 1280}$ is the difference of feature vectors. $Com(*)$ represents three fully connected layers. $\Delta z$ denotes the predicted defocus distance.

5. Experiments

5.1 Training details

The actual training of all models was performed on an NVIDIA GeForce GTX 1650 GPU and implemented in Keras of Tensorflow. The parameters of the Adam optimizer were $\beta _1=0.5$ and $\beta _2=0.999$, and the initial learning rate was $110^{-4}$ and dropped by half every five epochs. All models are trained for a total of $36$ epochs with a batch size of $8$. The loss function of all models can be defined as follows:

(13)$$LOSS = \frac{1}{N} \sum_{i = 1}^{N}\left(\Delta z_{i}^{{true }}-G\left(I_{i}\right)\right)^{2}$$

where $I_i$ is the out-of-focus input image, $\Delta z_{i}^{true}$ denotes the actual defocus distance of the input image, $N$ represents the size of the batch, $G(*)$ is the network model.

The training data of all models are enhanced. The images are randomly flipped horizontally or vertically to weaken the position information of the split-image pattern.

5.2 Comparison of four autofocus methods

The experiment demonstrated that SCM is better than the three state-of-the-art methods: non-learning method, learning method, and learning split-image prediction model (LSPM) [31]. All four methods are single-shot autofocus algorithms. Based on the non-learning method, LSPM restores the defective split image and then predicts the defocus distance. Learning method, LSPM, and SCM all use MobileNetV3 as the feature extraction network, and they predict the defocus distance directly from a single out-of-focus image.

All methods used the same datasets, including the dataset in this paper and the dataset in [31]. The comparative results are presented in Table 1 and Table 2, where the values represent the averages of the testing set, in terms of root mean square error (RMSE) and mean absolute error (MAE). The RMSE and MAE can be defined as follows:

(14)$$RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n}\left(\Delta z_{i}^{ {true }}-\Delta z_{i}^{ {predict }}\right)^{2}}$$

(15)$$MAE = \frac{1}{n} \sum_{i = 1}^{n}\left|\Delta z_{i}^{ {true }}-\Delta z_{i}^{ {predict }}\right|$$

where $n$ is the total number of slices in the test set, $\Delta z_{i}^{true }$ represents the true defocus distance of the slice, and $\Delta z_{i}^{predict }$ denotes the predicted defocus distance of the slice.

Table 1. Results of SCM and baselines on the testing set in this paper

View Table | View all tables in this article

Table 2. Results of SCM and baselines on the testing set in [31]

View Table | View all tables in this article

First of all, the dataset in [31] was used to demonstrate the focusing performance of SCM, and the prediction results were compared with other methods, as shown in Table 2. All methods input images with a size of $512 \times 512 \times 3$ and predict them with CPU. The experiment showed that the focusing accuracy of SCM was higher than that of non-learning method and learning method, and was basically similar to LSPM, but the prediction time of LSPM was significantly higher than that of SCM.

Compared with the dataset in [31], the dataset in this paper has a shallower depth of field and a more complex detection background. The experimental results based on the dataset in this paper can be observed in Table 1. SCM outperformed the other approaches with an RMSE of $0.414$ compared to the closest baseline value of $0.562$, and an MAE of $0.327$ compared to $0.425$. Due to the single model parameter, the non-learning method is easily affected by complex backgrounds and has a larger focusing error than the learning method, which makes the model have higher accuracy and robustness by learning a large number of split images under different backgrounds. In terms of CPU processing time, SCM is slower than the learning method but faster than other methods, as presented in Table 1. All methods input images with a size of $256 \times 256 \times 3$ and predict them with CPU. Thereby, the overall processing time will be longer. In short, SCM is better than the state-of-the-art methods in accuracy and has faster processing time than some methods.

Figure 7 shows the autofocusing performance of four methods for 4 focal stacks which made in this paper. In each focal stack, the scatter plot depicts the predicted defocus distance of each slice, and the focusing errors are calculated by the four methods respectively. At the same time, the split images in different states and their corresponding natural image are displayed. The results clearly demonstrated that SCM is more stable in any scene. Non-learning-based method is affected by the detection background, so that the focusing accuracy of each focal stack is quite different. Learning-based method can adapt to different detection backgrounds, and SCM has higher focusing accuracy and robustness than the other two learning methods.

Fig. 7. Comparison of autofocusing performance of SCM with others for 4 focal stacks. In each focal stack, the scatter plot depicts the predicted defocus distance of 161 slices, and the focusing errors are calculated by the four methods respectively. The split images in different states and their corresponding natural image are displayed on the graph.

Download Full Size | PDF

5.3 Comparison of learning method and SCM

In addition, to further demonstrate the superiority of the structure of SCM, six different lightweight networks were used as the feature extraction networks of learning method and SCM respectively to calculate the focusing accuracy of each model on the test set. The results are presented in Table 3. The model parameters of SCM and Learning Method based on different feature extraction networks were equal, but the focusing accuracy of SCM was better than that of Learning Method. Therefore, SCM exhibits higher focusing performance.

Table 3. Comparison of Learning Method and SCM based on different feature extraction networks

View Table | View all tables in this article

5.4 Comparison of natural-image-based method and split-image-based method

To demonstrate that the split-image-based method has better focusing performance than the natural-image-based method, a natural-image dataset was made, where the natural image corresponds to the split image one by one, and the reference position of each focal stack of the natural image was adjusted by the sharpness metric. The natural-image-based method [21], which predicts the defocus distance by the regression model from a natural image rather than a split image, is similar to the learning-based method mentioned in this paper. The two split-image-based methods correspond to 4.2 and 4.3, respectively. All three methods used MobileNetV3 as the feature extraction network. The split-image dataset and the natural-image dataset were divided into training, validation, and testing sets using the same grouping format. The experimental results showed that the split-image-based method has better overall focusing accuracy and robustness than the natural-image-based method, as shown in Table 4, and the focusing accuracy of the SCM proposed for split-image is $44{\% }$ higher than that of the natural-image method.

Table 4. Comparison of natural-image-based method and split-image-based methods

View Table | View all tables in this article

5.5 Prediction process of SCM

Figure 8 displays the process of predicting the defocus distance from an out-off-focus split image and the focusing error for 1 focal stack. The ROI was extracted from an arbitrary defocus image, and then was divided into two slices. Two slices were simultaneously input into the network for feature extraction and comparison to predict the defocus distance (Fig. 8(a)). SCM achieves autofocusing by means of laser active illumination and the natural images corresponding to the split images in the focusing process are shown in Fig. 8(a). The focusing error for 1 focal stack was irregularly distributed, as shown in Fig. 8(c).

Fig. 8. SCM for 1 focal stack. (a) The process of predicting the defocus distance from an out-off-focus split image. And the change of natural images are shown on the right. (b) shows more prediction results of split images. (c) The focusing error of SCM is plotted as a function of the axial defocus distance.

Download Full Size | PDF

6. Conclusion

In this work, a high-precision autofocus method with laser illumination was proposed, which predicts the defocus distance from a single split image. The common non-learning-based and learning-based methods for LSA were quantitatively analyzed and evaluated. To further improve the focusing accuracy from a single defocus image, a lightweight comparative framework model for LSA was build. Compared with state-of-the-art autofocus methods, SCM had higher focusing accuracy and stronger robustness to resist complex detection backgrounds interference. It was also demonstrated that LSA has better focusing performance than the natural-image-based method. Note that LSA works with split images instead of natural images, and thus it can be applied to many other fields.

Funding

National Natural Science Foundation of China (51975344, 62176149).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Dataset 1, Ref. [43]

References

1. X. Zhang, F. Zeng, Y. Li, and Y. Qiao, “Improvement in focusing accuracy of dna sequencing microscope with multi-position laser differential confocal autofocus method,” Opt. Express 26(2), 887–896 (2018). [CrossRef]

2. K. Ji, P. Zhao, C. Zhuo, M. Chen, J. Chen, H. Jin, S. Ye, and J. Fu, “Ultrasonic autofocus imaging of internal voids in multilayer polymer composite structures,” Ultrasonics 120, 106657 (2022). [CrossRef]

3. M. C. Montalto, R. R. McKay, and R. J. Filkins, “Autofocus methods of whole slide imaging systems and the introduction of a second-generation independent dual anning method,” J. Pathology Inf. 2(1), 44 (2011). [CrossRef]

4. K. Guo, J. Liao, Z. Bian, X. Heng, and G. Zheng, “Instantscope: a low-cost whole slide imaging system with instant focal plane detection,” Biomed. Opt. Express 6(9), 3210–3216 (2015). [CrossRef]

5. Z. Bian, C. Guo, S. Jiang, J. Zhu, R. Wang, P. Song, Z. Zhang, K. Hoshino, and G. Zheng, “Autofocusing technologies for whole slide imaging and automated microscopy,” J. Biophotonics 13(12), e202000227 (2020). [CrossRef]

6. S. Pertuz, D. Puig, and M. A. Garcia, “Analysis of focus measure operators for shape-from-focus,” Pattern Recognit. 46(5), 1415–1432 (2013). [CrossRef]

7. A. Santos, C. Ortiz de Solórzano, J. J. Vaquero, J. M. Pena, N. Malpica, and F. del Pozo, “Evaluation of autofocus functions in molecular cytogenetic analysis,” J. Microsc. 188(3), 264–272 (1997). [CrossRef]

8. L. Firestone, K. Cook, K. Culp, N. Talsania, and K. Preston Jr, “Comparison of autofocus methods for automated microscopy,” Cytom. The J. Int. Soc. for Anal. Cytol. 12(3), 195–206 (1991). [CrossRef]

9. S.-Y. Lee, J.-T. Yoo, Y. Kumar, and S.-W. Kim, “Reduced energy-ratio measure for robust autofocusing in digital camera,” IEEE Signal Process. Lett. 16(2), 133–136 (2009). [CrossRef]

10. Y. Sun, S. Duthaler, and B. J. Nelson, “Autofocusing in computer microscopy: selecting the optimal focus algorithm,” Microsc. Res. Tech. 65(3), 139–149 (2004). [CrossRef]

11. S. Jiao, P. W. M. Tsang, T.-C. Poon, J.-P. Liu, W. Zou, and X. Li, “Enhanced autofocusing in optical scanning holography based on hologram decomposition,” IEEE Trans. Ind. Inf. 13(5), 2455–2463 (2017). [CrossRef]

12. Z. Ren, E. Y. Lam, and J. Zhao, “Acceleration of autofocusing with improved edge extraction using structure tensor and schatten norm,” Opt. Express 28(10), 14712–14728 (2020). [CrossRef]

13. J. He, R. Zhou, and Z. Hong, “Modified fast climbing search auto-focus algorithm with adaptive step size searching technique for digital camera,” IEEE Trans. Consumer Electron. 49(2), 257–262 (2003). [CrossRef]

14. N. Kehtarnavaz and H.-J. Oh, “Development and real-time implementation of a rule-based auto-focus algorithm,” Real-Time Imaging 9(3), 197–203 (2003). [CrossRef]

15. Z. Wu, D. Wang, and F. Zhou, “Bilateral prediction and intersection calculation autofocus method for automated microscopy,” J. Microsc. 248(3), 271–280 (2012). [CrossRef]

16. C. Herrmann, R. S. Bowen, N. Wadhwa, R. Garg, Q. He, J. T. Barron, and R. Zabih, “Learning to autofocus,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 2230–2239.

17. C. Li, A. Moatti, X. Zhang, H. T. Ghashghaei, and A. Greenbaum, “Deep learning-based autofocus method enhances image quality in light-sheet fluorescence microscopy,” Biomed. Opt. Express 12(8), 5214–5226 (2021). [CrossRef]

18. L. Wei and E. Roberts, “Neural network control of focal position during time-lapse microscopy of cells,” Sci. Rep. 8(1), 7313 (2018). [CrossRef]

19. T. Pitkäaho, A. Manninen, and T. J. Naughton, “Focus prediction in digital holographic microscopy using deep convolutional neural networks,” Appl. Opt. 58(5), A202–A208 (2019). [CrossRef]

20. Y. Xiang, Z. He, Q. Liu, J. Chen, and Y. Liang, “Autofocus of whole slide imaging based on convolution and recurrent neural networks,” Ultramicroscopy 220, 113146 (2021). [CrossRef]

21. J. Liao, X. Chen, G. Ding, P. Dong, H. Ye, H. Wang, Y. Zhang, and J. Yao, “Deep learning-based single-shot autofocus method for digital microscopy,” Biomed. Opt. Express 13(1), 314–327 (2022). [CrossRef]

22. C. Wang, Q. Huang, M. Cheng, Z. Ma, and D. J. Brady, “Deep learning for camera autofocus,” IEEE Trans. Comput. Imaging 7, 258–271 (2021). [CrossRef]

23. T. R. Dastidar and R. Ethirajan, “Whole slide imaging system using deep learning-based automated focusing,” Biomed. Opt. Express 11(1), 480–491 (2020). [CrossRef]

24. A. Shajkofci and M. Liebling, “Spatially-variant cnn-based point spread function estimation for blind deconvolution and depth estimation in optical microscopy,” IEEE Trans. on Image Process. 29, 5848–5861 (2020). [CrossRef]

25. H. Pinkard, Z. Phillips, A. Babakhani, D. A. Fletcher, and L. Waller, “Deep learning for single-shot autofocus microscopy,” Optica 6(6), 794–797 (2019). [CrossRef]

26. S. Jiang, J. Liao, Z. Bian, K. Guo, Y. Zhang, and G. Zheng, “Transform-and multi-domain deep learning for single-frame rapid autofocusing in whole slide imaging,” Biomed. Opt. Express 9(4), 1601–1612 (2018). [CrossRef]

27. S. Cheon, H. Lee, C. O. Kim, and S. H. Lee, “Convolutional neural network for wafer surface defect classification and the detection of unknown defect class,” IEEE Trans. Semicond. Manufact. 32(2), 163–170 (2019). [CrossRef]

28. H.-I. Lin and P. Menendez, “Image denoising of printed circuit boards using conditional generative adversarial network,” in 2019 IEEE 10th International Conference on Mechanical and Intelligent Manufacturing Technologies (ICMIMT), (IEEE, 2019), pp. 98–103.

29. C.-S. Liu and H.-D. Tu, “Innovative image processing method to improve autofocusing accuracy,” Sensors 22(13), 5058 (2022). [CrossRef]

30. X. Zhang, F. Fan, M. Gheisari, and G. Srivastava, “A novel auto-focus method for image processing using laser triangulation,” IEEE Access 7, 64837–64843 (2019). [CrossRef]

31. Z. Hua, X. Zhang, D. Tu, X. Wang, and N. Huang, “Learning to high-performance autofocus microscopy with laser illumination,” Measurement 216, 112964 (2023). [CrossRef]

32. D. A. Kerr, “Principle of the split image focusing aid and the phase comparison autofocus detector in single lens reflex cameras,” (2005).

33. W. K. Pratt, “Correlation techniques of image registration,” IEEE Trans. Aerosp. Electron. Syst. AES-10(3), 353–358 (1974). [CrossRef]

34. P. Viola and W. M. Wells III, “Alignment by maximization of mutual information,” Int. J. Comp. Vis. 24(2), 137–154 (1997). [CrossRef]

35. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

36. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 770–778.

37. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 4510–4520.

38. A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, and V. Vasudevan, “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF international conference on computer vision, (2019), pp. 1314–1324.

39. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXivarXiv:1704.04861 (2017). [CrossRef]

40. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, “An image is worth 16×16 words: Transformers for image recognition at scale,” arXivarXiv:2010.11929 (2020). [CrossRef]

41. K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Ghostnet: More features from cheap operations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 1580–1589.

42. M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning, (PMLR, 2019), pp. 6105–6114.

43. Z. Hua, “Chip and wafer datasets,” figshare, (2023), https://doi.org/10.6084/m9.figshare.23909595.v1.

Algorithm	Error ( $μ$ m)		CPU Time (ms)
Algorithm	RMSE	MAE	CPU Time (ms)
Non-learning method	1.464	0.969	124
Learning method	0.680	0.461	69
LSPM	0.562	0.425	208
SCM	0.414	0.327	92

Algorithm	RMSE ( $μ$ m)			MAE ( $μ$ m)			CPU Time (ms)
Algorithm	$\pm$ 200	$\pm$ 300	$\pm$ 480	$\pm$ 200	$\pm$ 300	$\pm$ 480	CPU Time (ms)
Non-learning Method	2.421	3.509	5.444	1.978	2.736	3.938	173
Learning Method	1.612	1.904	2.597	1.283	1.580	2.166	141
LSPM	1.185	1.473	2.073	0.945	1.155	1.505	329
SCM	1.164	1.385	2.098	0.935	1.027	1.812	193

Feature Extraction Network	Learning Method Error ( $μ$ m)		Learning Method Parameter	SCM Error ( $μ$ m)		SCM Parameter
Feature Extraction Network	RMSE	MAE	Learning Method Parameter	RMSE	MAE	SCM Parameter
EfficientNet_B0 [42]	0.791	0.572	7.0 M	0.526	0.382	7.0 M
EfficientNet_B1 [42]	0.766	0.555	9.5 M	0.437	0.348	9.5 M
GhostNet [41]	0.732	0.535	5.8 M	0.435	0.339	5.8 M
MobileNet_V1 [39]	0.744	0.526	7.3 M	0.481	0.356	7.3 M
MobileNet_V2 [37]	0.773	0.538	5.2 M	0.428	0.342	5.2 M
MobileNet_V3 [38]	0.680	0.461	6.2 M	0.414	0.327	6.2 M

Algorithm	Error ( $μ$ m)
Algorithm	RMSE	MAE
Natural-image-based Method + Learning Method	0.776	0.589
Split-image-based Method + Learning Method	0.680	0.461
Split-image-based Method + SCM	0.414	0.327

Algorithm	Error ( $μ$ m)		CPU Time (ms)
Algorithm	RMSE	MAE	CPU Time (ms)
Non-learning method	1.464	0.969	124
Learning method	0.680	0.461	69
LSPM	0.562	0.425	208
SCM	0.414	0.327	92

Autofocus methods based on laser illumination

Abstract

1. Introduction

2. Principle of the split-image autofocusing

3. Dataset

4. Autofocusing methods of split image

4.1 Non-learning-based method

4.1.1 Similarity model

4.1.2 Calibrated model

4.2 Learning-based method

4.2.1 Feature extraction network

4.2.2 Feature integration network

4.3 Proposed method

4.3.1 Feature extraction network

4.3.2 Feature comparison network

5. Experiments

5.1 Training details

5.2 Comparison of four autofocus methods

5.3 Comparison of learning method and SCM

5.4 Comparison of natural-image-based method and split-image-based method

5.5 Prediction process of SCM

6. Conclusion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (8)

Tables (4)

Equations (15)

Optics Express