Deep neural networks for single shot structured light profilometry

Sam Van der Jeught; Joris J. J. Dirckx

doi:10.1364/OE.27.017091

1. Introduction

Structured light profilometry (SLP) or active illumination techniques illuminate the surface of the measurement object with predefined spatially varying intensity patterns and record these patterns as they are deformed by the object’s shape when observed under an angle. The general setup of an SLP system is illustrated in Fig. 1. Digitalization of this basic projector-camera configuration has enabled numerous specific SLP-techniques to be developed, each with their own strengths and weaknesses [1,2]. Though they are all unique in their specific implementation, differentiation between them can be made primarily based on whether or not they require multiple projected patterns per 3D measurement (multi-shot techniques versus single-shot techniques) or whether or not they are color-dependent.

Fig. 1 Standard projector-camera configuration used in structured light profilometry techniques. A structured light modulator projects the pattern onto the scene and an imaging sensor placed at a relative angle with the projection axis records the deformed pattern. The height (z)-sensitivity vector is oriented along the bisector between the projection and observation axes.

Download Full Size | PDF

The recorded intensity map $I_{R}$ can be described as a linear equation of background illumination $I_{B}$ , intensity profile modulation $I_{M}$ and surface phase map $φ (x, y)$ :

I_{R} (x, y) = I_{B} (x, y) + I_{M} (x, y) \cos (φ (x, y)) .

Therefore, a minimum set of three distinct intensity maps is analytically required to uniquely define

φ (x, y)

and thereby to extract the height map of the object surface. To this end, multi-phase shifting profilometry techniques generally shift the modulation of

I_{M}

by integer multiples of 2π / n (with n ≥ 3) between successive recordings of

I_{R}

in so-called n–step phase shifting techniques. Here, a trade-off is typically made between speed and accuracy. Increasing the number of phase shifts per 3D measurement improves the measurement resolution, whereas smaller n reduces measurement time between 3D recordings and allows shape measurements to be recorded at higher speeds.

Although state-of-the-art 3-step phase shifting profilometry systems are able to obtain high-resolution object surface maps in real-time [3], there are some fundamental limitations to multi-phase setups when compared to single-shot techniques. For example, as they require multiple fringe patterns to be recorded per measurement, the 3D frame rate of multi-shot techniques is limited to a fraction of the maximum camera frame rate and some degree of motion artifacts is inevitable when measuring dynamically moving objects. In addition, a dedicated synchronization engine between projector and camera is needed which requires triggering hardware and onboard projector memory to display the successive fringe patterns in a loop, complicating the technical setup. To avoid these drawbacks, a number of single-shot profilometry techniques have been developed over the years. First, it is possible to superimpose multiple phase-shifted patterns onto a single color image [4]. This color pattern is then projected onto the measurement surface and is recorded by a color-sensitive CCD sensor. By separating the color channels in post-processing, the distinct phase-shifted fringe patterns are extracted and the object height map can be reconstructed using Eq. (1). Next, color can be used directly to landmark each point on the object surface with a specific wavelength by illuminating it with a predefined rainbow pattern [5]. This way, a one-to-one correspondence between the projection angle and a particular wavelength can be established and the surface height map can be triangulated. The disadvantage of color-based techniques, however, is that a color-independent reflectivity profile of the object is assumed and that the quality of the resulting depth map is greatly influenced by the amount of cross-talk between color channels [6]. This sensitivity can be alleviated by bucketing the wavelengths in discrete levels, but in this case, the in-plane resolution is sacrificed [7]. Alternatively, grayscale indexing methods are single-shot techniques that project a series of dots, stripes or patterns onto the target and employ pattern search algorithms to solve the correspondence problem [8]. These methods are color-independent and robust but perform significantly worse in terms of in-plane and depth resolution when compared to other techniques. Another subset of grayscale single-shot SLP techniques is the group of Fourier transform profilometry (FTP) techniques [9]. FTP techniques typically employ standard sinusoidal fringe patterns, making them less sensitive to image defocus when compared to indexing methods. On the other hand, a successful FTP reconstruction depends heavily on the accurate frequency selection of only the originally projected sine waves in Fourier space. As the carrier frequency is stretched to a principal frequency band by the slope of the object surface, it inevitably overlaps with other frequency components that are present in the image and cannot be retrieved from Fourier space unambiguously.

Recently, machine learning techniques have replaced task-specific algorithms in a wide range of applications. More specifically, deep learning (DL) has been successfully applied to fields including computer vision [10], bioinformatics [11], speech recognition [12], natural language processing [13] and even board games [14] due to advances in hardware (GPU-based parallel programming), software techniques (weight initialization and backpropagation strategies) and increased data availability. Deep neural networks (DNN’s) consist of a hierarchy of layers, whereby each layer transforms the output of the previous layer into more abstract representations or features of the input data. The more layers a network has, the more complex features can be learned and the higher the level of abstraction that can be obtained. These features are ultimately combined in the output layer of the network to make a prediction or decision based on the initial data that was entered into the input layer. Typically, nonlinear activation functions are used between successive layers of DNN’s. This both increases the network’s predictive capabilities and obscures human intuition into how exactly the prediction is made. The weight matrices that compose the hidden layers are updated using backpropagation algorithms during an iterative process called network training. Depending on the complexity of the mapping function 𝑓 that is to be emulated, a rather large set of known input-output couples or training data is typically required to approximate 𝑓 with acceptable accuracy.

Convolutional neural networks (ConvNets or CNN’s) are a subset of deep neural networks that are most commonly used in image analysis tasks involving object recognition and classification. Based on the working of the visual cortex in animals, convolutional neurons or filters perform a convolution operation to only a limited area of data in the previous layer called the receptive field. Each filter scans the entire previous layer and modifies its weights accordingly during training to maximally reflect the patterns that are present in the input data. In lower-level layers, these filters are sensitive to simple patterns such as lines, curves, and edges, whereas in higher-level layers they start representing more abstract and complex features. The final layer of the network then combines these features into either a fully connected layer where it maximizes one of a select array of discrete options (classification algorithms) or again into a final convolutional layer that outputs a N-dimensional array of continuous output variables (regression algorithms).

In this paper, we will construct a CNN that is able to predict the 3D height map of an object when given only a single input fringe pattern that is deformed by the object shape when observed under a known angle. Formally, we will approximate a mapping function 𝑓: X⟶Y that transforms a 2D array of grayscale values (X) to a continuous 3D height distribution (Y) without any additional intermediate processing steps. We will train the network by providing it with a large set of simulated height - fringe pattern couples and demonstrate as a proof of principle that neural networks can be used to demodulate fringe patterns in single-shot profilometry systems. Although deep learning techniques have been reported to replace subroutines in phase demodulation algorithms (frequency selection automation in FTP [15], period order detection [16] or background and fringe modulation separation [17] in PSP), this is, to the best of our knowledge, the first time that DL is used to replace the phase demodulation process entirely as an end-to-end solution. This new method has the main advantage that noise reduction, background masking, phase unwrapping and any other optional or mandatory intermediate data processing steps can be bundled together in a single mapping function.

2. Methods

2.1 Training data set

To train the network, a large set of simulated height maps with corresponding deformed fringe patterns are needed. In order to automate the process of constructing such a training data set, a random surface map generator was designed to accept input parameters such as volumetric boundary limits, oscillation frequency, number of peaks and troughs (Np), and sharpness or smoothness of the edges. Within predefined limits, the surface map generator randomly varies these input parameters and produces a fixed number of surface maps of selected pixel size. A sample of the height maps in the training data set is shown in Fig. 2.

Fig. 2 Random selection of simulated height maps from the training data set, ordered in function of number of peaks (Np) and sampled on grids of 128 × 128 pixels.

Download Full Size | PDF

Next, a predefined fringe pattern with sinusoidal intensity modulation is virtually projected onto the 3D height maps, after which the surface texture is observed under a fixed angle α. Using straightforward triangulation, the modulated fringe pattern is sampled on a Cartesian grid of fixed size. This process is illustrated in Fig. 3.

Fig. 3 Fringe modulation process. A predefined fringe pattern (b) is projected onto a randomly generated height map (a), after which the surface texture (c) is observed under an angle α (here α = 30°) and the deformed fringe pattern (d) is sampled onto a Cartesian grid.

Download Full Size | PDF

In order to mimic real structured light profilometry setups, realistic values for fringe pattern frequency (6 periods per image width) and the angle between the projection and observation axes (α = 30°) were chosen in the generation of our training data set. Deformed fringe patterns and height maps were each sampled on grids of 128 × 128 pixels, and height values were confined to $z \in ℝ^{3} [0, 32]$ . Note that the maximum z-value of 32 does not necessarily need to be attained for every surface map; it is simply the upper limit. The 4:1 ratio between image width and maximum surface height reduces the amount of shadows that may arise during the fringe modulation process. Nevertheless, an explicit check was included in pre-processing to ensure the local gradient did not exceed the angle between the projection and observation axes and that no regions with shadows or invalid data are present in the training data set. After fringe modulation, volumetric boundary limits of the surface maps in our data set were normalized to $x, y, z \in ℝ^{3} [0, 1]$ .

2.2 Network architecture

The architecture of our deep neural network is outlined in Fig. 4.

Fig. 4 Our network structure. A total of 10 convolutional layers separate the input and output layer. Nonlinear activation (ReLU) and dropout layers are included after every convolutional layer but are not shown here.

Download Full Size | PDF

It contains an image input layer, 10 convolutional layers, and an output layer. After each convolutional layer, a rectified linear unit (ReLU) was used for nonlinear mapping and a dropout layer was included to increase regularization. Except for the first and the final layer, all convolutional layers are of the same size: 64 filters of size 5 × 5 × 64, where each filter operates on 5 × 5 spatial regions across 64 channels (or feature maps). The first layer operates on the image input layer and consists of 64 filters of size 5 × 5 and the final layer consists of a single filter of size 5 × 5 × 64. In total, 925’441 learnable parameters are present across all layers in our model. The exclusive use of convolutional layers is inspired by recent work performed in single image super-resolution (SISR) applications and results in an efficient model that can be used on images of different input sizes [18]. In addition, network inference can be significantly sped up since convolutional operations can be implemented highly efficiently on parallel hardware [19].

2.3 Training the network

A data set of N = 12500 simulated input-output couples was created using the procedure described above. The number of peaks and troughs Np present in the height map varied randomly between 0 (flat planes of random height) and 19 (highly fluctuating landscapes). Either Gaussian or linear curve fitting was used to interpolate between peaks and troughs, leading to surface maps with either smooth or sharp edges, respectively. A total of Ntrain = 10000 data couples were used to train the network and Nval = 2500 couples are used for validation.

Our deep learning profilometry network was trained using the Tensorflow framework [20]. The network was trained with a batch size of 32 for 500000 iterations, giving roughly 1600 epochs over the entire training data set. Adam optimization [21] was used with a learning rate of 1 × 10⁻⁴. Pre-training, the filter weights were set using Xavier initialization [22] to make the variance of the outputs of a layer equal to the variance of its input. In order to reduce overfitting, both dropout and weight decay (L2) was used to increase regularization. Initially, the network started training with dropout at 50% and L2 regularization at 10⁻³. After every 100000 iterations, dropout was halved, the learning rate was divided by a factor of 5 and L2 penalties were divided by a factor of 10. During the fifth and final set of iterations, both dropout percentage and L2 penalties were set to 0. Root mean square error (RMSE) was employed as the cost function to train the network. Our implementation uses cuDNN and was optimized for parallel execution. In total, training time took roughly 120 hours on a single Titan X Pascal GPU.

3. Results

After training, the validation RMSE per pixel converged to 0.0069 for the validation set of 2500 data couples. This means that the network is able to extract height maps from previously unseen deformed fringe patterns correctly to within an error margin of < 0.7% of the maximum height range per data point, on average. To illustrate the predictive capabilities of our network, three deformed fringe patterns were randomly selected from the validation set and are presented in Fig. 5, together with their respective ground truth height maps and network predictions. It can be seen that network prediction maps are nearly indistinguishable from the ground truth height maps. The error maps, which represent the difference between ground truth and network prediction, suggest a fairly smooth distribution of prediction error without any clear systematic errors and indicate worst case errors within < 9%.

Fig. 5 Network inference on samples drawn randomly from the validation set. The first two columns represent the deformed fringe pattern – surface map data couples, the third column shows network prediction and the fourth column includes the error map. The numbers of peaks Np present in the height maps are 4, 6 and 9 for the first, second and third samples, respectively. X- and Y-coordinates are displayed as pixel numbers on a 128 × 128 grid; Z-values are normalized to [0, 1].

Download Full Size | PDF

Since the data couples in the validation set were created using the same random height map generator that was used to construct the training data set, it is also interesting to assess network performance when it is presented with fringe patterns that were modulated from independently created height maps. To this end, a small sample of 3D figures was constructed and rescaled to the same volumetric boundaries as accepted by the fringe modulation procedure and the resulting deformed fringe patterns were used as input to the network. The results are presented in Fig. 6.

Fig. 6 Network inference on samples created independently from the random surface generator. The first two columns represent the deformed fringe pattern – surface map data couples, the third column shows network prediction and the fourth column includes the error map. The upper part of a sphere, a triangular step function, and a mannequin doll head are included in row 1, row 2 and row 3, respectively. X- and Y-coordinates are displayed as pixel numbers on a 128 × 128 grid; Z-values are normalized to [0, 1].

Download Full Size | PDF

First, the upper part of a sphere with a radius of 150 pixels was cut off so that the total height was 32 pixels or one-fourth of the image width. Apart from a small indentation at the top of the sphere, the network height map prediction corresponds well with the ground truth map. The average RMSE/pixel is < 0.8% and the worst case error is < 7%.

Second, a triangular step function with a period of 1/3rd of the image width and a maximum height of 32 pixels was generated. Again, the network is able to reconstruct the 3D height map from the input fringe pattern rather well, although this time some periodicity can be noticed in the error map along the lines connecting the upper and lower boundaries of the step function in the X-dimension. The average RMSE/pixel is < 0.5% and the worst case error is < 2%.

Finally, a cropped section of a mannequin doll head measurement was rescaled to fit the maximum volumetric boundaries of the modulation procedure. The ground truth 3D surface map was gathered from standard 4-step phase shifting profilometry measurements and parts of the 3D-figure where shadows or regions of invalid data corrupted the measurement were filled in with standard interpolation techniques in post-processing. It can be noticed that the network is able to assess the overall 3D-shape of the head well, but that it fails in parts of the image near the eyes, nose, and mouth where the phase shifts quickly. The average RMSE/pixel is < 1.1% and the worst case error is < 13%.

In order to investigate the effect of noise on network inference accuracy, we have added different levels of Gaussian noise to an input sample from the validation set. The results can be seen in Fig. 7. It can be seen that low levels of noise (σ = 0.001 to σ = 0.01) have little to no effect on the output prediction error (RMSE/pixel remains well within < 1% with worst case error < 12%). The presence of heavier levels of noise (σ = 0.1) however does reduce the accuracy of network prediction (RMSE/pixel now above 3% with worst case error < 24%) but does not inhibit the network’s ability to reconstruct the general form of the 3D height map from the noisy image.

Fig. 7 Network inference on samples with varying levels of noise. Gaussian noise with a mean of µ = 0 and variance ranging from σ = 0 (top row) to σ = 0.1 (bottom row) was added to a sample drawn from the validation set before network inference. X- and Y-coordinates are displayed as pixel numbers on a 128 × 128 grid; Z-values are normalized to [0, 1].

Download Full Size | PDF

It should be noted that although the time needed to train the network can take several hours to days, inference is much faster. A forward pass through the network consists mainly of matrix multiplications and convolutions, which can be implemented efficiently on parallel hardware such as FPGA’s and GPU’s. Inference of a single input fringe pattern takes under 2ms for a 128 × 128-pixel image on a Titan X GPU.

4. Discussion

4.1 Training data set

The main purpose of this research paper is to provide a proof of principle that deep learning networks can be used to extract height information directly from single-shot input fringe patterns. The exclusive use of simulated height maps in the construction of our training data set does not detract from this objective. Rather, it allowed us to quickly generate an arbitrarily large set of representative data couples which adhere to certain boundary restrictions. The simulated height maps used to train our network contained a number of peaks and troughs Np, randomly set between 0 and 19. Setting the maximum arbitrarily at Np = 19 generated 20 different classes with varying degrees of surface oscillation and provided the training set with enough local surface diversity to reconstruct height maps of realistic measurement targets with high accuracy. Setting Np with equal probabilities for all classes may be suboptimal though; the indentation at the top of the spherical surface measurement may indicate that the network imposed a higher order frequency on the 3D-figure and suggests that a larger representation of figures with smaller Np-numbers should have been used in the training data set. The periodicity in the error map of the triangular step function suggests another possible modification to be made to the training data set: since interpolation occurs between peaks and valleys of random heights, the chance of having two adjacent points with identical or similar z-values is rather low, and especially longer running straight edges are currently underrepresented in the training data set. This could be corrected by including a small set of such custom height maps. Finally, larger network errors in the mannequin doll head’s eye, mouth and nose regions may suggest that larger Np-value landscapes are required in the training set in order to measure such high-frequency local height variations. Depending on the general objective of the profilometry setup, setting a representative range of Np numbers and tweaking the ratios between them may be crucial to improving network performance. However, more research needs to be done to confirm this.

4.2 Evaluating the network

The (R)MSE loss is the default loss metric to use as cost function during training for regression problems [23]. Here, its implementation is justified both empirically (the model converges reasonably quickly) and intuitively (in contrast to super-resolution applications where metrics are used that correspond more to human visual perception of images [24], here Euclidian or L2-distance is a more representative metric to determine height precision than visual appeal).

An exhaustive grid search was conducted to optimize the hyperparameters using the GridSearchCV class of the SciKit learn API [25]. Network properties such as filter size, number of channels, number of layers, learning rate and batch size were adapted dynamically whilst monitoring network convergence. Configurations were trained up to 10 epochs after which the most promising candidates were trained further until convergence was reached. The final configuration described in this paper is the one that produced the lowest validation error on our data set, though it should be noted that it is not guaranteed that a global optimum was reached. Indeed, different network architectures may reduce the error even further and it will be the topic of future work to examine whether residual nets [26], generative adversarial networks [27] or networks including deconvolutional layers [28] may result in better performance.

Generally, deep neural networks are known to be highly robust to noise [29]. Our tests demonstrate the network’s robustness to noisy input images up to a level of noise representative of real structured light profilometry setups (Gaussian noise with σ = 0.01). Nevertheless, it may be interesting to include a representative set of input images with varying levels of noise to the training data set to further increase robustness. In addition, convolutional neural networks have been used to inter-and extrapolate across regions in images where there is limited to no valid information available [30] and to automatically separate foreground objects from the background [31]. These features may be useful in structured light profilometry applications and can be implemented directly as part of the networks’ mapping function by modifying the training data set.

4.3 Fringe planes

Typically, structured light profilometry setups which project periodically modulated intensity patterns are mathematically wrapped to the finite interval [-π, π] or [0, 2π], corresponding to the principle value domain of the employed demodulation function. This causes the resulting phase map to suffer from periodic artifacts, known as phase jumps, every time a fringe plane distance is crossed. Numerous phase unwrapping algorithms have been developed to reconstruct the true, unwrapped phase by adding an integral multiple of 2π to each point of the wrapped grid. Unfortunately, phase unwrapping algorithms are generally computationally expensive and their performance is severely hindered by the presence of noise, phase vortices or regions of invalid data in the phase map. By training the network to link deformed fringe patterns directly to their respective unwrapped phase maps, the necessity for phase unwrapping is effectively bypassed. It should be noted, though, that this is only true as long as the total object height does not exceed T / tan(α), with T the period of the projected fringe plane modulation function and α the angle between the projection and observation axes. Beyond this distance, modulated fringe patterns can no longer be distinguished from their counterparts at a distance T / tan(α) higher or lower. This effectively changes the orientation of the fringe planes from parallel to the projection axis as is the case in classic structured light profilometry to perpendicular to the z-height sensitivity vector.

4.4 Implementation in real optical profilometry setup

Next, it will be interesting to include real surface measurements in the future. This can be readily achieved by gathering 3D height maps using a standard n-step phase shifting profilometry setup and by storing only a single input fringe pattern per measurement together with the unwrapped phase map as a data couple. When prior knowledge about the type of measurement target is available, one can tune the data set to perform better on certain classes of surface maps simply by providing it with more variations of it during construction of the training data set. As a very large number of images needs to be recorded, this is a time intensive process which is beyond the scope of the present work. It is important to notice that acquisition of the training images may take some time, but once trained the inference of the network is extremely fast so use of the trained topography setup will also function with unprecedented speed, just as in the current simulations.

Finally, it should be noted that implementation in a real optical profilometry setup will introduce a trade-off between maximum measurement range and height prediction accuracy when selecting α. Reducing the angle between observation and projection axes in an optical setup will increase the maximum range in which unique deformed fringe patterns can be gathered, but will decrease the accuracy of the employed n-step phase shifting technique by moving the sensitivity vector. Since the phase maps obtained with the standard phase shifting profilometry algorithm are used to train the network, the accuracy of the network becomes inevitably linked to that of the PSP technique.

5. Conclusion

The construction of a training data set and the architecture of a convolutional neural network model designed to extract height information from single input fringe patterns were described. The network is able to measure height maps accurately with an RMS error per pixel of < 0.7% of the maximum height range within the object, on a validation set of randomly generated height maps. In addition, it was demonstrated that the network performs equally well when faced with fringe patterns that were modulated from custom height maps, created entirely independently from the surface map generator. This introduces a new class of deep-learning-based approaches to the family of single-shot structured light profilometry techniques.

Funding

Fonds Wetenschappelijk Onderzoek (FWO).

Acknowledgments

The Titan X Pascal GPU used for this research was donated by the NVIDIA Corporation.

References

1. J. Salvi, S. Fernandez, T. Pribanic, and X. Llado, “A state of the art in structured light patterns for surface profilometry,” Pattern Recognit. 43(8), 2666–2680 (2010). [CrossRef]

2. S. Van der Jeught and J. J. J. Dirckx, “Real-time structured light profilometry: a review,” Opt. Lasers Eng. 87, 1–14 (2016).

3. N. Karpinsky, M. Hoke, V. Chen, and S. Zhang, “High-resolution, real-time three-dimensional shape measurement on graphics processing unit,” Opt. Eng. 53(2), 024105 (2014). [CrossRef]

4. Z. Zhang, C. E. Towers, and D. P. Towers, “Time efficient color fringe projection system for 3D shape and color using optimum 3-frequency Selection,” Opt. Express 14(14), 6444–6455 (2006). [CrossRef] [PubMed]

5. J. Geng, “Rainbow three-dimensional camera: new concept of high-speed three-dimensional vision systems,” Opt. Eng. 35(2), 376–383 (1996). [CrossRef]

6. Y. Hu, J. Xi, E. Li, J. Chicharo, Z. Yang, and Y. Yu, “A calibration approach for decoupling colour cross-talk using nonlinear blind signal separation network,” in Conference on Optoelectronic and Microelectronic Materials and Devices, Proceedings, COMMAD (2005), pp. 265–268.

7. W. Liu, Z. Wang, G. Mu, and Z. Fang, “Color-coded projection grating method for shape measurement with a single exposure,” Appl. Opt. 39(20), 3504–3508 (2000). [CrossRef] [PubMed]

8. P. M. Griffin, L. S. Narasimhan, and S. R. Yee, “Generation of uniquely encoded light patterns for range data acquisition,” Pattern Recognit. 25(6), 609–616 (1992). [CrossRef]

9. X. Su and W. Chen, “Fourier transform profilometry,” Opt. Lasers Eng. 35(5), 263–284 (2001). [CrossRef]

10. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems paper 1097–1105 (2012).

11. L. J. Lancashire, C. Lemetre, and G. R. Ball, “An introduction to artificial neural networks in bioinformatics--application to complex microarray and mass spectrometry datasets in cancer studies,” Brief. Bioinform. 10(3), 315–329 (2009). [CrossRef] [PubMed]

12. A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 2013), pp. 6645–6649. [CrossRef]

13. R. Collobert and J. Weston, “A unified architecture for natural language processing,” in Proceedings of the 25th International Conference on Machine Learning - ICML ’08 (ACM Press, 2008), pp. 160–167. [CrossRef]

14. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature 529(7587), 484–489 (2016). [CrossRef] [PubMed]

15. W. Zhou, Y. Song, X. Qu, Z. Li, and A. He, “Fourier transform profilometry based on convolution neural network,” in Optical Metrology and Inspection for Industrial Applications V, S. Han, T. Yoshizawa, and S. Zhang, eds. (SPIE, 2018), 10819, p. 62.

16. Y. H. Chan and D. P. K. Lun, “Deep learning based period order detection in fringe projection profilometry,” in Proceedings, APSIPA Annual Summit and Conference 2018, 108–113 (2018).

17. S. Feng, Q. Chen, G. Gu, T. Tao, L. Zhang, Y. Hu, W. Yin, and C. Zuo, “Fringe pattern analysis using deep learning,” Adv. Phonetics 1(2), 1–7 (2018).

18. J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” in the IEEE conference on computer vision and pattern recognition 3431–3440 (2015).

19. M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic, E. Cosatto, and H. P. Graf, “A Massively Parallel Coprocessor for Convolutional Neural Networks,” in 2009 20th IEEE International Conference on Application-Specific Systems, Architectures and Processors (IEEE, 2009), pp. 53–60. [CrossRef]

20. M. Abadi, “TensorFlow: A system for large-scale machine learning,” Proc. 12th USENIX Symp. Oper. Syst. Des. Implement. (2016).

21. D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv preprint arXiv (2014).

22. X. Glorot and Y. Bengio, “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” Proceedings of the thirteenth international conference on artificial intelligence and statistics (2010).

23. A. Botchkarev, “Performance Metrics (Error Measures) in Machine Learning Regression, Forecasting and Prognostics: Properties and Typology,” arXiv preprint (2018).

24. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016), 9906 LNCS, pp. 694–711.

25. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay, “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011).

26. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition. (2015).

27. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Advances in neural information processing systems 27, 2672–2680 (2014).

28. H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision (2015), 2015 Inter, 1520–1528.

29. D. Rolnick, A. Veit, S. Belongie, and N. Shavit, “Deep learning is robust to massive label noise,” arXiv preprint (2017).

30. P. Upchurch, J. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Weinberger, “Deep feature interpolation for image content changes,” Proceedings of the IEEE conference on computer vision and pattern recognition (2017)

31. J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, and F. Wu, “Background Prior-Based Salient Object Detection via Deep Reconstruction Residual,” IEEE Trans. Circ. Syst. Video Tech. 25(8), 1309–1321 (2015). [CrossRef]

Deep neural networks for single shot structured light profilometry

Abstract

1. Introduction

2. Methods

2.1 Training data set

2.2 Network architecture

2.3 Training the network

3. Results

4. Discussion

4.1 Training data set

4.2 Evaluating the network

4.3 Fringe planes

4.4 Implementation in real optical profilometry setup

5. Conclusion

Funding

Acknowledgments

References

Cited By

Figures (7)

Equations (1)

Optics Express