End-to-end sensor and neural network design using differential ray tracing

A. Halé; P. Trouvé-Peloux; J.-B. Volatier

doi:10.1364/OE.439571

1. Introduction

The increasing interest in the field of computational imaging has naturally led to the question of the joint design of sensor and processing. In the literature, most co-design approaches are based on the definition of a theoretical performance model that takes into account both sensor and processing parameters [1–8]. However in most of these papers, only a restricted number of optical parameters are optimized: the aperture shape for depth from defocus [1], the flutter shutter code for motion blur removal [4], parameters of a cubic or a binary annular phase mask for depth of field extension (EDOF) [3,5–7] . Only few papers consider the whole set of lens optical parameters, which implies to interact with an optical design software [2,8–10].

Recent works in co-design have demonstrated the potential of the end-to-end design of a lens with a neural network [11–21]. The generic idea is to model the sensor with a convolutional layer within the neural network framework. The kernel of this layer corresponds to the sensor PSF simulated using a parametric optical model. This so called "sensor layer" encodes the input ideal image to model the image deformations due to the sensor. A conventional neural network then follows the sensor layer to process the deformed image for a given task. If the optical model is differentiable with respect to the optical parameters, the efficient optimization tools of the neural network can be used to jointly optimize the optical and the processing parameters [11–21]. Yet to model the PSF of the sensor layer, state of the art papers use Fourier optics which is indeed a differentiable model. However this model assumes parallax rays and a thin lens, which restricts the field of application.

In this paper, we propose a new method for end-to-end design of a sensor and a neural network illustrated in Fig. 1. As in state of the art papers [11–21], the sensor is modeled with a convolutional layer within the neural network framework, however we propose to use an optical model based on differential ray tracing (DRT), relying on the complete set of real lens parameters. In contrast with the literature, this model does not rely on the thin lens approximation nor paraxial rays. It can then be used for the joint optimization of any set of real lens and the neural network parameters, for any field of view. In this paper, we validate the proposed approach with three examples of image restoration. In particular, to ease the interpretation of the optimization results, we restrict ourselves with the optimization of a single optical parameter, here the focus setting of a given optical system.

Fig. 1. Principle of the proposed end-to-end design method : a differential ray tracing module provides the sensor PSF along with its partial derivative with respect to any optical parameter. These outputs are used to define a sensor layer (in red) that encodes a sharp scene according to the image formation model. The image is then processed by the neural network layers (green) - image restoration in this example, and compared to a ground truth image with the loss function $L$. The backpropagation framework of the neural network can then be conducted through the network and the sensor layers to jointly optimize the whole set of imaging and processing parameters.

Download Full Size | PDF

Note that we have been aware of the related work of Sun et al. in Ref. [22] published during the reviewing process of this paper which also proposes the use of differential ray tracing for end-to-end sensor optimization. We discuss our contributions with respect to this parallel work in section 3.

2. State of the art

2.1 End-to-end design with an analytical performance model

The joint design of sensor and processing has been proposed by Dowski and Cathey [23] in their pioneer work on depth of field extension (EDOF). A cubic phase plate is used to reduce the variation of the PSF with depth and a global deconvolution provides a sharp image with an enlarged depth of field. This concept has been further explored in Refs. [3,5–7] where the phase mask, either a cubic phase plate or an annular phase mask, is optimized based on an image quality criterion derived for a generalized Wiener deconvolution filter and an optical model based on Fourier Optics. For depth from defocus application, a coded aperture has been proposed to reinforce the PSF variation with depth [1]. The coded aperture is optimized by maximization of the Kullback Leibler distance between the likelihood of the potential depths evaluated using geometrical optics to model the PSF. In the field of motion blur removal, a flutter shutter code is optimized using a performance model derived from the Signal to Noise Ratio of the image after deconvolution [4]. All these papers optimize only a single optical element of the imaging system such as phase, mask, shutter of aperture shape. In contrast, Stork and Robinson include a restoration criterion, namely the RMSE, within an optical design software optimization loop, to conduct joint design of the full set of real lens parameters and processing parameters [2]. To impose PSF invariance, Burcklen et al. [10] define a surrogate design criterion taken from co-design of phase mask for EDOF. This criterion is then used to optimize image quality within the field of view of a complete imaging system after deconvolution. The proposed surrogate criterion is defined so that it can be directly implemented within the merit function of the optical design software. In the field of depth estimation, Trouvé et al. [8,9] have designed a camera dedicated to depth from defocus using two performance models: one for the performance of depth estimation using the Cramér Rao lower bound, and an image quality performance model associated to a restoration based on high frequency transfer. The optimization of the camera follows two steps, first a system specification using simple geometrical optics and then a finer optimization using an optical design software.

2.2 End-to-end design using neural networks

Advances of deep learning have recently paved the way for joint lens and neural network optimization. Various applications have been investigated such as EDOF [11,12,16], High Dynamic Range [19], depth estimation using unconventional optics [13,15,17,18] and lensless imaging [20]. The key is to model the imaging process with a convolutional layer within the neural network framework. The kernel of this layer corresponds to the sensor PSF given by a parametric optical model. This sensor layer encodes the input ideal image to model the image deformation due to the sensor, i.e. the blur due to the PSF and the acquisition noise. A conventional neural network then follows the sensor layer to process this image for a given task. Given this structure, using the efficient optimization tools based on gradient descent of the neural network, the optics and the processing parameters can be jointly optimized. The main challenge to do so is to provide within the optimization framework the gradient of the loss function with respect to the optical parameters. As in the first works on co-design, the optical components to be optimized are phase masks [11–13,16,17], coded aperture in [20] but also freeform lens [15]. In most of these papers, the optical model is derived from Fourier Optics. This model, that involves the Fourier transform of the complex amplitude function within the optical exit pupil plane, is differentiable with respect to the optical parameters that defines the phase function. However, this model assumes paraxial rays and thin lenses and consider only the optimization of a single optical element such as aperture shapes or phase function of a single lens, without considering the full set of real lens parameters. Very recently, in the work of Tseng et al. [24] is proposed the optimization of a complex lens along with neural network. To conduct end-to-end optimization of both processing and optical parameters, a meta-optics network is trained to accurately reproduce the behavior of complex lenses as a function of the optical parameters.

2.3 Differential ray tracing

Fourier optics depends on the thin lens approximation, optical surfaces are modeled as a phase function that only depends on the position on the tangent plane. However, most optical elements are thick and consequently the phase will also depend on the local inclination of the wavefront. The difference is significant and explains why optical systems are almost always designed using geometrical ray tracing which allows to accurately compute an aberrated PSF in the presence of thick elements. For end-to-end design using neural networks, the PSF must be differentiable with respect to the optical parameters. This can be achieved with differential ray tracing [25] which was recently extended to freeform optical surfaces [26], we call this approach generalized differential ray tracing.

3. Paper contributions and organization

In this paper, we propose to use a generalized differential ray tracing based model to co-design a lens along with the neural network processing. This model, defined in Section 4 is an improvement on state of the art [26]. Using ray tracing it provides the sensor PSF as well as its partial derivative with respect to the full set of optical parameters. We show in Section 5 how to combine this model with a neural network and to conduct an end-to-end design. In Section 6 we present three proves of concept of optimization of a complex system using the proposed approach, for specific applications of image restoration. We restrict ourselves to the case of a single parameter optimization - the focus setting - to facilitate the interpretation of the optimization results.

Note that the parallel work of Sun et al. [22], published during the reviewing process of this paper, also proposes to use a differential ray tracing model to the end-to-end design of a complex lens and a neural network. Experimental validations of the proposed approach for depth of field extension and large field of view imaging are conducted. However we present here a detailed explanation on the mathematical principles and the implementation of such a model. Additionally, the simple examples presented here allow to better understand and analyze the interaction between the lens and neural network during the optimization.

4. Differential ray tracing

In this section we briefly recall the principle of generalized differential ray tracing before describing the new implementation that we propose here. Finally we describe the usage of this model to simulate a PSF and its derivative with respect to the optical parameters.

4.1 Principle

Generalized differential ray tracing uses Fermat path principle to compute the derivatives of ray intersections with the optical surfaces of a lens. We uniquely define a ray by its origin on the object plane and its intersection with a pupil plane. We denote $\mathbf {P_i} \in \mathbb {R}^3$ the ray intersection with surface $i$, and optical surfaces are defined as functions that for any $\mathbf {w_i} \in \mathbb {R}^2$ associate a point $\mathbf {P_i} \in \mathbb {R}^3$. For example a spherical surface of curvature $c$ tangent of the $Z=0$ plane at a point $(0,0,0)$ will be defined by Eq. (1).

(1)$$\begin{aligned} \mathbf{w_{\mbox{sph}}} & = (u, v)\\ \mathbf{P_{\mbox{sph}}} & = \begin{pmatrix} u \\ v \\ \frac{c(u^2 + v^2)}{1+\sqrt{1 - c^2(u^2 + v^2)}} \end{pmatrix} \end{aligned}$$

Let OP be the optical path through a lens composed of $N$ surfaces and $n_{i, i+1}$ be the refractive index in the media between surface $i$ and surface $i+1$.

(2)$$\textrm{OP} = \sum_{i=1}^{N-1} n_{i, i+1} \lVert \mathbf{P_i} \mathbf{P_{i+1}} \rVert.$$

Since Fermat’s path principle states that rays travel along a stationary path through a lens, it follows that:

(3)$$\frac{\partial \textrm{OP}}{\partial \mathbf{w_i}} = \mathbf{0} \quad \forall i \in \left[ 2, N-1 \right].$$

We define the Fermat error function as being equal to Eq. (3). The Fermat error function $\mathbf {F}$ is valued in $\mathbb {R}^{2N-2}$ and equal to $\mathbf {0}$ for each ray. The Fermat error function is differentiable with respect to the optical parameters of the system (that we denote $\mathbf {\theta _{opt}}$ and which could be the curvature of a lens, a conicity parameter, the position of the detector) and with respect to $\mathbf {w_i}$.

We draw a distinction between the total derivative operator $d \cdot / d \cdot$ and the partial derivative operator $\partial \cdot / \partial \cdot$. The latter assumes that all other variables are kept constant, ignoring cross dependencies between variables. By application of the chain rule we derive:

(4)$$\frac{d \mathbf{F}}{d\mathbf{\theta_{opt}}} = \frac{\partial \mathbf{F}}{\partial\mathbf{\theta_{opt}}} + \frac{\partial \mathbf{F}}{\partial \mathbf{w_i}} \frac{\partial \mathbf{w_i}}{\partial\mathbf{\theta_{opt}}}.$$

The only unknown in Eq. (4) is $\partial \mathbf {w_i} / \partial \mathbf {\theta _{opt}}$. Indeed, the Fermat error function dependency on $\mathbf {\theta _{opt}}$ and $\mathbf {w_i}$ is through the functions describing the optical surfaces which are known and can be differentiated.

Therefore Eq. (4) is a linear system whose solution yields $\partial \mathbf {w_i} / \partial \mathbf {\theta _{opt}}$. This process can also be referred as an application of the implicit function theorem, further details on the derivation can be found in our previous work [26].

4.2 Implementation

Previous work relied on Theano [27] to compute $\frac {\partial \mathbf {F}}{\partial \mathbf {\theta _{opt}}}$ and $\frac {\partial \mathbf {w_i}}{\partial \mathbf {\theta _{opt}}}$ and to aggregate the results of the ray tracing derivatives to the metric of interest.

In this work our implementation relies instead on ForwardDiff.jl [28], which implements forward-mode automatic differentiation, more suited than other modes when the number of computed outputs (raytracing results) exceeds the number of inputs (optical parameters).

In ForwardDiff.jl, forward-mode differentiation is implemented using dual numbers. A dual number is a pair of a value and its derivative with respect to an input. At each operation, ForwardDiff.jl updates simultaneously the value and the derivative. For example if we wish to compute the derivative of $\sin (x^2)$ with respect to $x$ for $x=1$, ForwardDiff.jl will proceed as in Table 1.

Table 1. Example of a forward mode differentiation calculation

View Table | View all tables in this article

For a trivial example such as $\sin (x^2)$, the advantage is not obvious at first, but for a complicated function such as the Fermat Error function, one notes that each step only depends on the result of the previous step. This allows to perform the derivative calculation side-by-side with the primal calculation.

ForwardDiff.jl implements differentiation rules for a large number of elementary operations so if a calculation is expressed in terms of these elementary operations, ForwardDiff.jl can differentiate it. This greatly simplifies the implementation of a derivative calculation, since the chain rule is applied automatically there is no need for cumbersome differentiation of large mathematical expressions.

In practical terms, ForwardDiff.jl is implemented in Julia, a language that strikes a good compromise between ease-of-use and performance. At the core of Julia is the concept of multiple dispatch. It means that a function (here we employ the term in a programming context) can be defined multiple times for different combinations of argument types. For example, the $\sin$ function can have two definitions, when the argument is a real number or when the argument is a dual number. The second definition will perform the calculation described in row 3 of Table 1.

It is also possible for a Julia program to add a new differentiation rule to ForwardDiff.jl. DRT uses this mechanism and adds a rule to differentiate the raytrace function.

4.3 Practical use

We have developed a specific toolbox for optical system simulation based on the proposed differential ray tracing model. As illustrated in Fig. 2, in practice the proposed differential ray tracing model takes the usual inputs parameters for the description of a real optical system, denoted $\mathbf {\theta _{opt}}$, i.e. for each lens: the material, the radius of curvature, the conic constant, the width and the relative position, as well as the pupil position and size and the sensor position. It also takes as input the wavelength and the point source position with respect to the first lens, denoted $\alpha$ and $z$ corresponding respectively to the field angle with respect to the optical axis and point source depth on this axis. The number of rays $N_{\mathbf {rays}}$ is also a parameter within the model. At the output, the model provides a covariance matrix $\Sigma ^{\alpha ,z}_{\mathbf {\theta _{opt}}}$ and the centroid $\mu ^{\alpha ,z}_{\mathbf {\theta _{opt}}}$ of the rays reaching the sensor, as well as their Jacobian with respect to the optical parameters. The covariance matrix and the centroid are computed by tracing a sample of $N_{\mathbf {rays}}$ rays through the optical system and by computing $\Sigma ^{\alpha ,z}_{\mathbf {\theta _{opt}}}$ and $\mu ^{\alpha ,z}_{\mathbf {\theta _{opt}}}$ from the sample of ray - detector intersections obtained. Since we supply a rule to differentiate the raytrace, and since centroid and covariance calculations are expressed in terms of elementary operations that ForwardDiff.jl can differentiate out-of-the-box we directly obtain $\Sigma ^{\alpha ,z}_{\mathbf {\theta _{opt}}}$ and $\mu ^{\alpha ,z}_{\mathbf {\theta _{opt}}}$ and their jacobian with respect to the optical parameters.

Fig. 2. Illustration of the use of the proposed DRT model. From a set of optical parameters and an object position (field angle $\alpha$ and depth $z$), the differential ray tracing model provides a covariance matrix $\Sigma _{\mathbf {\theta _{opt}}}^{\alpha ,z}$ and a centroids $\boldsymbol {\mu }_{\mathbf {\theta _{opt}}}^{\alpha ,z}$ corresponding to the rays reaching the sensor. A Gaussian model is then used to model the PSF $h$ and its gradients with respect to the optical parameters.

Download Full Size | PDF

The PSF $h$ is modeled as a 2D Gaussian according to the equation :

(5)$$h_{\mathbf{\theta_{opt}}}^{\alpha,z} [\textbf{u}]=\frac{1}{2\pi |\Sigma_{\mathbf{\theta_{opt}}}^{\alpha,z}|^{1/2}} \exp{-\frac{(\textbf{u}-\boldsymbol{\mu}_{\mathbf{\theta_{opt}}}^{\alpha,z})^t \Sigma_{\mathbf{\theta_{opt}}}^{\alpha,z}{}^{{-}1} (\textbf{u}-\boldsymbol{\mu}_{\mathbf{\theta_{opt}}}^{\alpha,z})}{2}}.$$

where $\textbf {u}$ correspond to pixel coordinates in 2D and $|.|$ stands for the matrix determinant. Besides, we convert the PSF covariance matrix and centroid from mm$^2$ and mm to the pixel unit according to the sensor pixel size. Note that we neglect here the smoothing impact of the pixel.

5. Sensor layer

In this section we describe the sensor layer that encodes for the image formation model. We separate here the forward model and the backward model within the neural network framework.

5.1 Forward model

Given the sensor PSF noted $h_{\mathbf {\theta _{opt}}}^{\alpha ,z}$ associated to a point source placed at the field angle $\alpha$ and at depth $z$, the classical image formation model writes at pixel $\textbf {u}$:

(6)$$y[\textbf{u}]=(h_{\mathbf{\theta_{opt}}}^{\alpha,z}*x) [\textbf{u}]+n[\textbf{u}],$$

where n stands for additive noise, usually modeled as a white Gaussian noise of standard deviation $\sigma _n$. Eq. (6) is only valid in a region where the PSF is spatially invariant, which amounts to neglect locally the variation of the PSF with respect to field angle, and to assume a constant depth. Hence, in the following we consider only image patches of small dimension for which this assumption is valid.

Given the convolutional relation between the acquired image and the sharp ideal scene, the sensor can be seen as an encoding of the sharp scene and modeled with a convolutional layer of a neural network. The kernel of this layer corresponds to the PSF. White Gaussian noise can also be added within this layer to model the sensor noise. Then the simulated images can be processed by the neural network for any task.

5.2 Backward model

The backward model aim is to define the gradient of the loss function with respect to the optical parameters. In the following we drop the explicit dependence of $h$ to ($z$, $\alpha$, $\mathbf {\theta _{opt}}$) for the sake of simplicity. Based on the image formation model of Eq. (6), the partial derivative of the image $y$ with respect to the optical parameters $\mathbf {\theta _{opt}}$ reads:

(7)$$\frac{\partial y}{\partial \mathbf{\theta_{opt}}}=\frac{\partial h}{\partial \mathbf{\theta_{opt}}}*x,$$

where $\frac {\partial h}{\partial \mathbf {\theta _{opt}}}$ can be derived using chain rule from the knowledge of partial derivatives $\frac {\partial \mathbf {\mu }}{\partial \mathbf {\theta _{opt}}}$ and $\frac {\partial \Sigma }{\partial \mathbf {\theta _{opt}}}$ (provided by the DRT optical model) and the knowledge of partial derivatives $\frac {\partial h}{\partial \mu }$ and $\frac {\partial h}{\partial \Sigma }$ (derived from Eq. (5)). The relation between the partial derivative of the loss function $L$ with respect to the optical parameters is directly given by :

(8)$$\frac{\partial L}{\partial \mathbf{\theta_{opt}}}=\frac{\partial L}{\partial y}.\frac{\partial y}{\partial \mathbf{\theta_{opt}}}=\frac{\partial L}{\partial y} . \left(\frac{\partial h}{\partial \mathbf{\theta_{opt}}}*x\right),$$

where the first term, corresponding to the gradient of the loss $L$ with respect to the image, is provided directly by backpropagation functions within the neural network framework.

6. Applications for image restoration

As stated in section 4, our differentiable optical model can be used for the optimization of any set of real lenses parameters. Yet, for a first validation of our approach, and for a better understanding of the optimization results, we present in the following results of a single parameter optimization, that is the focus setting. Hence, we use an already optimized optical system and optimize the sensor position to change the focus.

In this section, we present three examples of the proposed method applied to image restoration. In the first two examples, our optical system is the classical Double Gauss, illustrated in Fig. 3. As already mentioned in the introduction of this section, we consider here that the only optical parameter left to optimization is the sensor position to change the focus setting. First, we use only the sensor layer and optimize the sensor position to restore a sharp image. This amounts to asking the lens to be focused on the object. In the second example, we add a restoration network after the sensor layer and optimize jointly the sensor position and the network parameters for the same task of sharp image restoration. In the last example, we consider the concept of depth of field extension with a lens having chromatic aberration introduced in literature [11,29]. To simply have chromatic aberration, we add a specific doublet, as proposed previously [30], in front of the Double Gauss. This doublet adds a spectral variation of the lens focal length. We then optimize the sensor position to obtain the best image quality over a given depth range. In the following we provide the generic settings that we use for these examples.

Fig. 3. (a) Double Gauss lens. (b) Chromatic add-on as proposed in Trouvé-Peloux et al. [30] parametrized by the radius of curvature $R$ (the lower $R$, the higher the induced chromatic aberration).

Download Full Size | PDF

Dataset Due to the spatial variation of the PSF, we consider here only image patches where the depth $z$ and the field angle $\alpha$ (i.e. the PSF) is assumed to be constant. The dataset used as input is the Describable Texture Dataset [31] as proposed in the co-design literature for EDOF using a neural network [11]. It contains 47 different categories with a total of 5640 images and 4 randomly cropped patches extracted of each. We end up with 22 600 patches of resolution 64$\times$64 pixels, randomly split between the training and the test set with a ratio 80%−20% in batches of 16 patches. A complete epoch is made of 1130 batches.

DRT model settings To conduct our experiments, we choose to work with a classical Double Gauss lens of focal 100 mm lens with an aperture diameter of 9 mm. We consider the scene to be in the range 10 to 50 m. To simplify the interpretation of the results of the interaction of the sensor layer and the network layers, we make several assumptions. First, as already mentioned in the beginning of this section, we assume that the sensor position is the only optical parameter left for optimization. Second, we only simulate on-axis PSF and neglect the variation of the PSF with field angle. Hence, we only use the first diagonal element of the covariance matrix, corresponding to the standard deviation referred to as $\sigma$, which characterizes the defocus blur size. For all the following experiments, the PSF is of size 21 $\times$ 21 pixels and is computed using 256 rays, which seemed to be a reasonable trade-off to obtain fast and accurate PSF simulation. To model a color sensor, we simulate the PSF at three RGB wavelengths respectively at 600 nm, 530 nm and 480 nm.

Restoration network The image restoration network used in subsections 6.2 and 6.3 is the deep Residual Encoder-Decoder network ’RED-Net’ by [32]. It proved itself to be simple enough - with 298,947 trainable parameters in our implementation – in order to quickly converge and avoid gradient vanish on our upstream sensor parametric layer.

General implementation details We use the $\mathcal {L}_1$ loss and the Adam optimizer [33] with only the learning rates being tuned in order to adjust the accuracy and speed of convergence with respect to the increasing complexity of the networks.

6.1 Lens focus using only the sensor layer

In this simple example, the object is placed on axis at 25 m ahead of the Double Gauss first surface and the sensor is voluntarily defocused to induce blurred and noisy color images. Different starting points for the sensor position are considered – $\theta _{\mathbf {opt}}= \{130, 132, 133.3, 134, 135\}$mm, with 133.3 mm being almost in-focus. In this experiment the image at the output of the sensor layer is directly compared to the ideal sharp image (there is no neural networks layers). Hence, training will update the sensor position in order to minimize the difference between the ground truth sharp scene and the synthetic image encoded by the sensor layer. Figure 4(a) shows the evolution of the sensor position through the training for each of the five starting points. Figure 4(b) shows the variation of the mean standard deviation $\sigma$ of the PSF over the three RGB color channels during training. Both Figs. 4(a) and 4(b) depict the quick convergence of the sensor to the same position for all of the starting points. In all cases, the minima are reached within the first 60 batches out of the 1130 that constitutes one epoch. Every starting point oscillates around a plane of best focus, including the one almost in-focus from the very beginning. Such a fast convergence is not surprising regarding the shallow network having a single parameter to update.

Fig. 4. Convergence through the training of the sensor position (4a) and the mean defocus blur size $\sigma$ along the three channels (RGB)(4b), for several starting points of the sensor position. Learning rate is 0.1.

Download Full Size | PDF

This example illustrates the ability of our proposed DRT model to be used within the neural network framework.

6.2 Lens focus using the sensor layer and a restoration network

In this second example, we keep the same Double Gauss with an object in front of it at 25 m, but in this case the sensor layer is now associated with the RED-Net restoration network. Once again, we train the network with several starting points of the sensor positions: $\mathbf {\theta _{opt}} = \{132.0, 132.5, 133.0, 133.5, 134.0, 134.5\}$ mm. All of them are realistically out-of-focus, with a Gaussian blur standard deviation not exceeding the kernel size, at the exception of the first position at 132 mm.

Figure 5(a) shows the evolution of the sensor position along the process of 8 epochs. Starting points 133.0 and 133.5 mm quickly reach a stable position in the beginning of the training, whereas more highly defocused starting point such as 134.5 mm, reaches the same final sensor position within the sixth epoch. Broadly speaking, the higher the sensor is defocused, the slower it reaches an optimal position, which is not surprising due to the gradient descent optimization of the neural network framework. It should be noted that the network was able to converge even with starting point for which initial blur standard deviation was larger - 24 pixels, according to Fig. 4(b) - than the 21 pixels kernel of the PSF. Moreover all the starting points reach the same sensor position around 133.35 mm as in our first experiment, where only the sensor layer was used. Hence, the joint optimization efficiently leads to a solution where the lens is focused correctly on the object. However the convergence point is obtained here after a larger number of epoch than in the first example (Fig. 4(a)). This is not surprising as in this case around 300k parameters are to be optimized. This result shows that the high influence of the sensor position on the defocus blur is well exploited during the optimization. We observe that the final $\mathcal {L}_1$ test loss after 8 epoch is around 0.02 compared to 0.03 in the first example when there is only the sensor layer, which shows that a restoration is effective thanks to the neural network. Table 5b accounts for the restoration metrics of the network on the test dataset, after the process of 8 epochs for various noise levels – for a patch dynamic between 0 and 1, a Gaussian noise of 0.01 is visible but subtle, while 0.05 is high noise level. Our optimization results show that variations of the noise level do not affect the sensor optimal position.

Fig. 5. Convergence to the optimized sensor position (5a) through the training and evaluation of the starting points ${133.5, 134.0}$ for several different Gaussian noises (with $\sigma$ varying from 0.01 to 0.05). (5b). RMSE stands for root mean square error and MAE for mean absolute error. Learning rate is 0.001.

Download Full Size | PDF

This second example illustrates that the DRT model can be used for a joint optimization with complex networks. Incidentally, even optically aberrant starting points like 132.0 mm end up being optimized, leading up the way to more challenging optimization problems.

6.3 Optimization of a lens with chromatic aberration and a restoration network for EDOF

In our last example, we consider a state of the art concept of depth of field extension (EDOF) using a lens with chromatic aberration [29]. The main idea is that such a lens has for each depth a channel that is sharpest than the others. Guichard and al. [29] proposed to conduct a high frequency transfer to restore the image having a large depth of field. For this task, Elmalem et al. [11] optimize jointly a annular phase mask and a restoration neural network using Fourier Optics. Here, to model a complete set of lens having chromatic aberration, we simply add to the Double Gauss lens the add-on proposed in Ref. [30]. This add-on, illustrated in Fig. 3(b) is a doublet made of glasses having the same index of refraction for a given wavelength but various dispersion. Hence the doublet has an infinite focal length for this specific wavelength, but adds a spectral variation of the focal length for the other wavelengths. The amount of chromatic aberration can be directly tuned using the radius of curvature of the doublet. Here we empirically choose a radius of curvature of 200 mm, in order to set the in-focus planes of the RGB color channels within the depth range of interest. In our experiment, the add-on parameters are fixed, and we optimize the sensor position jointly with the restoration network previously used in section 6.2, in order to put the R,G,B in-focus planes at an optimal position for image restoration over a given depth range. To perform EDOF, each new batch is processed with an object randomly placed from 10 to 50 meters in front of the camera. It allows for an homogeneous brewing of the synthetic intermediate image that is seen by the RED-Net restoration network, within the depth range of interest. We expect the training to return the sensor optimized position to perform EDOF using the proposed optical system.

We train jointly the sensor position and the restoration layers with different starting points being $\mathbf {\theta _{opt}} = \{145.0, 145.25, 145.5, 145.75, 146.0\}$ mm. The starting points are chosen empirically to have an wide variety of positions of the RGB in-focus planes within the depth range of interest from 10 to 50 m. Figure 6(a) shows the evolution of the sensor position along the 8 epochs of the training. Compared to Figs. 4(a) and 5(a), the randomized brewing of the object position led to small variation on the optimal positions, but all three starting points end up within one epoch to an optimized sensor position around 145.29 mm. Figure 6(b) compares the blur standard deviation $\sigma$ given by our DRT model, on 3-channels (RVB) along the depth of field for two sensor positions, with pale dotted lines associated to the sensor placed at 145.5 mm, and the continuous line with the optimised position at 145.3mm. Each channel is depicted with its associated color - blue line for the blue channel and so on, so that the chromatic aberration appears clearly. Table 2 accounts for the evaluation of our model on the test dataset. We compare here the performance obtained using one of the starting point with a fixed sensor layer (only the restoration layer is optimized) with the joint optimization of the sensor layer and the restoration layers. For each case, we evaluate the error in norm $\mathcal {L}_2$ (RMSE) and norm $\mathcal {L}_1$ (MAE) for object positions between 5 to 50m, with steps of 5m. Visual interpretation of Fig. 6(b) and quantitative evaluations of Table 2 suggest that the starting point at 145.5mm shows smaller defocus blurs for objects between 15 and 20m, and the best restoration performances in this range. On the other hand, the joint optimization leading to the sensor position at 145.29mm shows globally improved performance metrics on the range 10 to 50m, as required. The R,G,B in focus planes have been moved to get smaller defocus blur in the range of interest.

Fig. 6. Convergence to the optimized sensor position (6a) through the training and comparison of the defocus blur size $\sigma$ along the depth-of-field before and after training with the starting point $\mathbf {\theta _{opt}} = 145.5$ mm (6b). Learning rate is 0.001.

Download Full Size | PDF

Table 2. Comparison of image restoration metrics with or without the optimization of the sensor layer for starting point 145.50 mm. RMSE stands for root mean square error and MAE for mean absolute error. Values in bold denotes better restoration.

View Table | View all tables in this article

We provide qualitative restoration results in Fig. 7 which shows the ground truth $x$, the imaging intermediate output $y$ and the restoration output $\hat {x}$ for different test patches on the range 10 to 50m. These qualitative results are consistent with Table 2: the restoration is of good quality throughout the range 20 to 50m. Blur is significantly reduced and a high level of details is restored. As expected by quantitative evaluation of Table 2 the objected put at 10 m shows poor restoration results, due to too high level of defocus blur on each color channel.

Fig. 7. Simulation and restoration process of test patches. First row: ground truth test patches placed from 10 to 50 meters in front of the optics. Second row: imaging intermediate output of our trained sensor layer. Third row: restoration using the trained restoration network. Numeric values at the top-left corner of a given patch is the RMSE with respect to its corresponding ground truth patch.

Download Full Size | PDF

This experiment illustrates the potential of the proposed method for co-design of unconventional optics for computational photography applications.

7. Conclusion and perspectives

In this paper we have presented a new method for joint design of optics and a neural network processing. The proposed method is based on a differential ray tracing optical model which provides the sensor PSF and its partial derivatives with respect to the whole set of lens parameters. These model outputs can be directly included in the neural network framework thanks to a sensor convolutional layer that encodes the sharp scene to simulate the acquired image. This layer can be followed by any neural network for processing the image. The backpropagation functions of deep learning framework then allow the joint optimization of the optical and the network parameters. In contrast with state of the art co-design approach, our model, based on ray tracing, does not make any paraxial nor thin lens assumption, thus can be used for any object position within the sensor field of view. Using three proves of concept, we have illustrated the ability of our method to optimize jointly a single optical parameter and a restoration processing with consistent results. These simple examples, open the path for more complex optimization using the proposed method. Indeed, in further works, more parameters of the lens will be be optimized using the proposed method, including on and off axis position considerations. We expect new challenges in the optimization of more than one optical parameter such as multiple local minima in the loss function, or convergence to non physical values of the optical parameters. We also expect an increased influence of the starting point. All these questions will be the subject of our future works. Finally, a direct PSF at the output of the DRT model is currently under study to avoid the use of the Gaussian PSF model.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [31].

References

1. A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth from a conventional camera with a coded aperture,” in Proceedings of the Association of Computing Machinery’s Special Interest Group on Computer Graphics and Interactive Techniques, (2007), p. 70es.

2. D. G. Stork and M. D. Robinson, “Theoretical foundations for joint digital-optical analysis of electro-optical imaging systems,” Appl. Opt. 47(10), B64–B75 (2008). [CrossRef]

3. F. Diaz, F. Goudail, B. Loiseaux, and J.-P. Huignard, “Increase in depth of field taking into account deconvolution by optimization of pupil mask,” Opt. Lett. 34(19), 2970–2972 (2009). [CrossRef]

4. A. Agrawal and R. Raskar, “Optimal single image capture for motion deblurring,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2009), pp. 2560–2567.

5. R. Falcón, F. Goudail, and C. Kulcsár, “How many rings for binary phase masks co-optimized for depth of field extension?” in Imaging and Applied Optics Conference, OSA Technical Digest (online), (Optical Society of America, 2016), p. CTh1D.5.

6. O. Lévêque, C. Kulcsár, A. Lee, H. Sauer, A. Aleksanyan, P. Bon, L. Cognet, and F. Goudail, “Co-designed annular binary phase masks for depth-of-field extension in single-molecule localization microscopy,” Opt. Express 28(22), 32426–32446 (2020). [CrossRef]

7. A. Fontbonne, H. Sauer, and F. Goudail, “Theoretical and experimental analysis of co-designed binary phase masks for enhancing the depth of field of panchromatic cameras,” Opt. Eng. 60(03), 1–20 (2021). [CrossRef]

8. P. Trouvé, F. Champagnat, G. Le Besnerais, G. Druart, and J. Idier, “Performance model of depth from defocus with an unconventional camera,” J. Opt. Soc. Am. A 38(10), 1489–1500 (2021). [CrossRef]

9. P. Trouvé, F. Champagnat, G. Le Besnerais, G. Druart, and J. Idier, “Design of a chromatic 3d camera with an end-to-end performance model approach,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, (IEEE, 2013), pp. 953–960.

10. M.-A. Burcklen, H. Sauer, F. Diaz, and F. Goudail, “Joint digital-optical design of complex lenses using a surrogate image quality criterion adapted to commercial optical design software,” Appl. Opt. 57(30), 9005–9015 (2018). [CrossRef]

11. S. Elmalem, R. Giryes, and E. Marom, “Learned phase coded aperture for the benefit of depth of field extension,” Opt. Express 26(12), 15316–15331 (2018). [CrossRef]

12. V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, “End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging,” ACM Trans. Graph. 37(4), 1–13 (2018). [CrossRef]

13. H. Haim, S. Elmalem, R. Giryes, A. M. Bronstein, and E. Marom, “Depth estimation from a single image using deep learned phase coded mask,” IEEE Trans. Comput. Imaging 4(3), 298–310 (2018). [CrossRef]

14. J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, “Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,” Scientific Reports (2018).

15. J. Chang and G. Wetzstein, “Deep optics for monocular depth estimation and 3d object detection,” in Proceedings of the IEEE International Conference on Computer Vision, (IEEE, 2019), pp. 10193–10202.

16. U. Akpinar, E. Sahin, and A. Gotchev, “Learning optimal phase-coded aperture for depth of field extension,” in Proceedings of IEEE International Conference on Image Processing, (IEEE, 2019), pp. 4315–4319.

17. Y. Wu, V. Boominathan, H. Chen, A. Sankaranarayanan, and A. Veeraraghavan, “Phasecam3d — learning phase masks for passive single view depth estimation,” in Proceedings of IEEE International Conference on Computational Photography, (IEEE, 2019), pp. 1–12.

18. E. Nehme, D. Freedman, R. Gordon, B. Ferdman, L. E. Weiss, O. Alalouf, T. Naor, R. Orange, T. Michaeli, and Y. Shechtman, “Deepstorm3d: dense 3d localization microscopy and psf design by deep learning,” Nat. Methods 17(7), 734–740 (2020). [CrossRef]

19. C. Metzler, H. Ikoma, Y. Peng, and G. Wetzstein, “Deep optics for single-shot high-dynamic-range imaging,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2020), pp. 1375–1385.

20. R. Horisaki, Y. Okamoto, and J. Tanida, “Deeply coded aperture for lensless imaging,” Opt. Lett. 45(11), 3131–3134 (2020). [CrossRef]

21. H. Ikoma, C. M. Nguyen, C. A. Metzler, Y. Peng, and G. Wetzstein, “Depth from defocus with learned optics for imaging and occlusion-aware depth estimation,” in Proceedings of IEEE International Conference on Computational Photography, (IEEE, 2021).

22. Q. Sun, C. Wang, Q. Fu, X. Dun, and W. Heidrich, “End-to-end complex lens design with differentiate ray tracing,” ACM Trans. Graph. 40(4), 1–13 (2021). [CrossRef]

23. E. R. Dowski and W. T. Cathey, “Extended depth of field through wave-front coding,” Appl. Opt. 34(11), 1859–1866 (1995). [CrossRef]

24. E. Tseng, A. Mosleh, F. Mannan, K. St-Arnaud, A. Sharma, Y. Peng, A. Braun, D. Nowrouzezahrai, J.-F. Lalonde, and F. Heide, “Differentiable compound optics and processing pipeline optimization for end-to-end camera design,” ACM Trans. Graph. 40(2), 1–19 (2021). [CrossRef]

25. D. P. Feder, “Differentiation of ray-tracing equations with respect to construction parameters of rotationally symmetric optics,” J. Opt. Soc. Am. 58(11), 1494–1505 (1968). [CrossRef]

26. J.-B. Volatier, Á. Mendui na-Fernández, and M. Erhard, “Generalization of differential ray tracing by automatic differentiation of computational graphs,” J. Opt. Soc. Am. A 34(7), 1146–1151 (2017). [CrossRef]

27. R. Al-Rfou, G. Alain, A. Almahairi, C. Angermüller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, Y. Bengio, A. Bergeron, J. Bergstra, V. Bisson, J. B. Snyder, N. Bouchard, N. Boulanger-Lewandowski, X. Bouthillier, A. de Brébisson, O. Breuleux, P. L. Carrier, K. Cho, J. Chorowski, P. F. Christiano, T. Cooijmans, M. Côté, M. Côté, A. C. Courville, Y. N. Dauphin, O. Delalleau, J. Demouth, G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Dumoulin, S. E. Kahou, D. Erhan, Z. Fan, O. Firat, M. Germain, X. Glorot, I. J. Goodfellow, M. Graham, Ç. Gülçehre, P. Hamel, I. Harlouchet, J. Heng, B. Hidasi, S. Honari, A. Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lamblin, E. Larsen, C. Laurent, S. Lee, S. Lefrançois, S. Lemieux, N. Léonard, Z. Lin, J. A. Livezey, C. Lorenz, J. Lowin, Q. Ma, P. Manzagol, O. Mastropietro, R. McGibbon, R. Memisevic, B. van Merriënboer, V. Michalski, M. Mirza, A. Orlandi, C. J. Pal, R. Pascanu, M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A. Romero, M. Roth, P. Sadowski, J. Salvatier, F. Savard, J. Schlüter, J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk, S. Shabanian, É. Simon, S. Spieckermann, S. R. Subramanyam, J. Sygnowski, J. Tanguay, G. van Tulder, J. P. Turian, S. Urban, P. Vincent, F. Visin, H. de Vries, D. Warde-Farley, D. J. Webb, M. Willson, K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang, “Theano: A python framework for fast computation of mathematical expressions,” arXiv e-prints pp. arXiv–1605 (2016).

28. J. Revels, M. Lubin, and T. Papamarkou, “Forward-mode automatic differentiation in Julia,” arXiv:1607.07892 [cs.MS] (2016).

29. F. Guichard, H. Nguyen Phi, R. Tessières, M. Pyanet, I. Tarchouna, and F. Cao, “Extended depth-of-field using sharpness transport across color channels,” Proc. SPIE 7250, 72500N (2009). [CrossRef]

30. P. Trouvé-Peloux, J. Sabater, A. Bernard-Brunel, F. Champagnat, G. L. Besnerais, and T. Avignon, “Turning a conventional camera into a 3d camera with an add-on,” Appl. Opt. 57(10), 2553–2563 (2018). [CrossRef]

31. M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2014), pp. 3606–3613.

32. X. Mao, C. Shen, and Y.-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,” Advances in neural information processing systems 29, 2802–2810 (2016).

33. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of International Conference on Learning Representations, Y. Bengio and Y. LeCun, eds. (2015), pp. 1–13.

Step	Primal calculation	Derivative calculation
1	$x = 1$	$d x / d x = 1$
2	$a = x^{2} = 1$	$d a / d x = 2$
3	$b = \sin a = 0.841$	$d b / d x = \frac{d b}{d a} \frac{d a}{d x} = 2 \cos a = 1.08$

	Without sensor layer optimization		With sensor layer optimization
Object position	RMSE	MAE	RMSE	MAE
10 m	0.1032	0.0667	0.1169	0.0774
15 m	0.0503	0.0317	0.0841	0.0530
20 m	0.0503	0.0320	0.0468	0.0298
25 m	0.0678	0.0430	0.0490	0.0310
30 m	0.0786	0.0499	0.0488	0.0312
35 m	0.0833	0.0528	0.0478	0.0298
40 m	0.0876	0.0555	0.0482	0.0309
45 m	0.0919	0.0586	0.0563	0.0356
50 m	0.0955	0.0614	0.0697	0.0392
Average	0.0787	0.0502	0.0631	0.0398

Step	Primal calculation	Derivative calculation
1	$x = 1$	$d x / d x = 1$
2	$a = x^{2} = 1$	$d a / d x = 2$
3	$b = \sin a = 0.841$	$d b / d x = \frac{d b}{d a} \frac{d a}{d x} = 2 \cos a = 1.08$

	Without sensor layer optimization		With sensor layer optimization
Object position	RMSE	MAE	RMSE	MAE
10 m	0.1032	0.0667	0.1169	0.0774
15 m	0.0503	0.0317	0.0841	0.0530
20 m	0.0503	0.0320	0.0468	0.0298
25 m	0.0678	0.0430	0.0490	0.0310
30 m	0.0786	0.0499	0.0488	0.0312
35 m	0.0833	0.0528	0.0478	0.0298
40 m	0.0876	0.0555	0.0482	0.0309
45 m	0.0919	0.0586	0.0563	0.0356
50 m	0.0955	0.0614	0.0697	0.0392
Average	0.0787	0.0502	0.0631	0.0398

End-to-end sensor and neural network design using differential ray tracing

Abstract

1. Introduction

2. State of the art

2.1 End-to-end design with an analytical performance model

2.2 End-to-end design using neural networks

2.3 Differential ray tracing

3. Paper contributions and organization

4. Differential ray tracing

4.1 Principle

4.2 Implementation

4.3 Practical use

5. Sensor layer

5.1 Forward model

5.2 Backward model

6. Applications for image restoration

6.1 Lens focus using only the sensor layer

6.2 Lens focus using the sensor layer and a restoration network

6.3 Optimization of a lens with chromatic aberration and a restoration network for EDOF

7. Conclusion and perspectives

Disclosures

Data availability

References

Data availability

Cited By

Figures (7)

Tables (2)

Equations (8)

Optics Express