Designing freeform imaging systems based on reinforcement learning

Tong Yang; Dewen Cheng; Yongtian Wang

doi:10.1364/OE.404808

1. Introduction

Using freeform optical surfaces in imaging systems is revolutionary in the field of optical design [1,2]. It breaks the geometric constraints of rotational or translational symmetry and has more parameters in the surface description, which offer many more degrees of design freedom for imaging systems design. Therefore, freeform surface can correct the aberrations well in non-symmetric systems and lead to higher system specifications, better performance, and increased compactness. Freeform surfaces have been used successfully in a range of applications, such as head-mounted displays [3–5] and head-up displays [6,7], reflective imagers and telescopes [8–14], imaging spectrometers [15], and ultrashort throw ratio projection optics [16].

The general method for performing an optical design task is first to determine a starting point that originates from patents or other existing systems and then apply software optimization. The optical design process is both an art and a science. Design requires advanced techniques such as optimization algorithms and a software program, whereas guidance for a good and useful solution to a design problem come from the designers. Guidance can be seen as design skills or design experience and it may vary for different systems. Freeform surfaces offer more degrees of design freedom and great potential for achieving high-performance systems, but also significantly increase the design complexity. Therefore, appropriate and novel optimization strategies as well as design experience, which are different from those for traditional systems, have to be used to help the design of freeform imaging systems.

For a specific freeform imaging system design task with given system specifications (particularly advanced specifications such as a large field-of-view (FOV) and/or large aperture), it is usually difficult to find an appropriate starting point whose system specifications match the design requirements well. Thus, the design often starts from an initial system with low system specifications and simple surface shapes (e.g., spheres) and then subsequent optimization is applied to get the final result. Generally, successive optimization strategies can be used in which the design freedoms are added successively and the system specifications are gradually increased. However, different sequences of these design steps lead to different final results, whose imaging performance may vary significantly. For freeform imaging system design there is a lack of design experience or expertise currently as a result of the complex and uncertain nature of freeform optics, in addition to the limited history of its usage. If designers want to find good design results with no prior experience, a great deal of trial and error are needed, and an extensive amount of human effort and time are required. Although many novel design methods for freeform imaging systems have been proposed in recent years, they can hardly acquire the key design experience and/or fully automate the design process. In addition, there is a high probability that the design falls into a local optimum with low imaging performance. To summarize, the design of freeform imaging systems (particularly with advanced specifications) is tedious and time-consuming work. This in turn further hinders the induction of design experience that are highly needed for the successful design of high-performance freeform imaging systems with advanced system specifications. A design framework that can handle the design task automatically while enabling the acquisition of design experience and insight for current and other systems is essential.

In recent years, artificial intelligence and machine learning have been applied to many areas of science and engineering. An important technique of machine learning is deep learning (DL), which mainly uses a multi-layered artificial neural network for data analysis, feature extraction and decision making. There have been several studies on the combination of DL and optical design. Gannon and Liang used an artificial neural network to generalize relationships between performance parameters and lens shape for the freeform illumination design task [17]. Yang et al. proposed a framework for starting points generation for freeform reflective systems using neural network based DL [18]. G. Côté et al. used DL techniques to get lens design databases to produce high-quality starting points from various optical specifications using both supervised training and unsupervised training [19]. Except for one illumination design, the above design frameworks generally focus on the design of starting points and no final design result and design experience can be obtained.

Another important branch of machine learning is reinforcement learning (RL). It is concerned with how software agents ought to take actions in a complex environment to maximize the notion of cumulative reward [20]. The computer uses trial and error to generate a solution to the problem. To make the machine do what the programmer wants, the artificial intelligence either receives rewards or penalties for the actions it performs. Using the RL process and related data analysis techniques, the software learning agent gathers experience and know how to complete a task satisfactorily. RL has been studied in many disciplines, such as game theory, control theory, operations research, information theory, and simulation-based optimization. If we combine RL with imaging optical design, particularly freeform imaging system design discussed in this paper, it is possible to complete a design task automatically from a starting point and significantly reduce human effort. Additionally, design experience may be acquired from the RL process, which is highly beneficial for other design tasks.

In this paper, we propose a design framework of freeform imaging systems using reinforcement learning. The RL framework mainly focuses on the design of freeform imaging optics using successive optimization process. A trial-and-error method is conducted in different episodes under an ε-greedy policy. An “exploitation-exploration, evaluation and back-up” approach is used to interact with the environment and discover optimal policies. Using this design framework, design results with good imaging performance and corresponding design routes can be founded automatically. The design experience, which can be used to guide the design of other related systems, can be further acquired using the obtained data directly or through other methods such as clustering-based machine learning. Human effort in both the design process and the tedious process of acquiring experience can be significantly reduced. A design example is presented to validate the feasibility of the approach. This design framework can be integrated into optical design software, and runs nonstop in the background or on physical/cloud servers to complete design tasks and acquiring experience automatically for various types of systems.

2. Method

2.1 Basic principles

Learning is generally achieved by interacting with the environment. For RL, the learning agent, or the “brain” must be able to sense the state of its environment to some extent and must be able to take actions that affect the state [21]. In this study, the environment can be considered as the optical design environment including the layout of the system, system specifications, surface parameters and imaging performance.

As demonstrated in the Introduction, freeform system design can start from an initial system with low system specifications and simple surface shapes. Then, successive optimization is used to obtain the final design result. A specific sequence of design steps can be defined as a design route. After each design step an intermediate system can be generated. Each system can be denoted as a state which can be characterized by, for example, its current system specifications, the design freedoms used for the optimization, the optical powers of the surfaces, and the locations of the freeform surfaces in three-dimensional space. These components can be integrated into a state vector s. The state components that denote the optical powers and the surface locations should be numbers that indicate the ranges in which the exact values fall, but not the exact values. Otherwise the number of different states will be very large and it will be difficult for the RL process and subsequent analysis to acquire recommended design routes.

The state of the initial system can be defined as the initial state sⁱⁿⁱ. At each state, the designer may choose to execute a specific design operation (e.g., increasing the FOV in x or y direction, expanding the system aperture, adding more design freedoms) and then apply further optimization. Here, different operations are defined as different actions in RL. The successive optimization process stops when an ending state is achieved, for example, when the maximum degrees of design freedom have already been added and the system specifications meet the design requirement. A tree structure can be established to demonstrate the overall design process, as shown in Fig. 1. The big circles represent states (can be considered as nodes in the tree) and the small circles represent different actions. Starting from the initial state (root node), different design routes can be generated, which lead to different design results and imaging performances.

Fig. 1. Scheme of the tree structure showing the optical design process

Download Full Size | PDF

The basic goal of the RL method proposed in this paper is to find the design routes or design strategies that lead to good imaging performance for a system design task, particularly for a system with advanced system specifications; that is, the RL process is used to find design routes that can maximize the reward value (R) of the final state in the route. The reward is an evaluation of imaging performance (e.g., wavefront error, spot size, and modulation transfer function (MTF)) and increases as the image quality improves. Reinforcement learning can be divided into two types: model-based RL and model-free RL [21]. Methods for solving RL problems that use models and planning are called model-based methods, as opposed to model-free methods that are explicitly trial-and-error learners. The “model” of the environment is something that mimics the behavior of the environment. Given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, If the model exists, the possible future situations of actions can be estimated before they are actually experienced [21]. As the designer may have no prior knowledge or actual experience of the design process, the RL method for this problem is model-free. An important concept called policy in RL is revisited which demonstrates the probability of selecting each action at a state, or can be considered a mapping from states to actions to be taken when in those states. The main difference between policy and model is that the policy is used to determine the specific action at a state, and the model can predict the resultant state and reward when specific action is executed at a specific state. Here, π(a|s) is defined as the probability that action a is performed at state s under the policy π. Then, the action value of a state-action pair can be defined and used to estimate how good it is to perform a given action at a given state. The notion of “how good” is defined in terms of the future final reward that can be expected. We use the notation q_π(s, a) for the action value as the expected return reward starting from s, taking the action a, and thereafter following policy π. During exploration a table called the Q-table can be established which records the action values of different actions at different states.

2.2 Reinforcement learning and data analysis method

The proposed RL method is similar with the Monte Carlo RL and Monte Carlo tree search [21], but there are some differences. The model-free RL method in this paper is for episodic tasks. In all episodes the design starts from the same initial state. After completing each episode, the reward value R of the final state is calculated and it is used to update the action values of the state-action pairs through which the design route passed. An ε-greedy policy is used to choose the action at each state to obtain a high reward with a high probability, and it has a low probability of exploring new states and design routes. As the number of passed episodes increases, the learning agent accumulates design experience, and a good design strategy can be acquired efficiently. The detailed procedure of the RL method is as follows:

(1) Establish the initial system, the design constraints and optimization environment for the system based on the requirements. Initiate an empty Q-table. Initiate the value of ε (a number between 0 and 1, but close to 0 for the ε-greedy policy). Define the ending states of the optimization process.
(2) Current state number i=1. Current design step number ω=1. Current episode number k=1.
(3) Start the entire design process from the initial state sⁱⁿⁱ (root node in the tree).
(4) Assume that the design process reaches the ith state s_i (not an ending state) and the current episode number is k, with a total of N_i possible actions that can be executed at this state. Check whether s_i exists in the Q-table. If yes, get all the action values q_π(s_i, a_j) (j=1…N_i). If no, add s_i into the Q-table and initiate all the action values q_π(s_i, a_j) (j=1…N_i) to zero.
(5) Find the action a* that has the maximum action value among q_π(s_i, a_j) (j=1…N_i). In the ε-greedy policy, we firstly allocate a probability of 1−ε to choosing a* (the greedy action) at s_i. Then with probability ε we can select an action among all the N_i possible actions (including a*) randomly. In this way, all nongreedy actions are given a probability of selection ε/N_i, while the remaining high probability 1−ε+ε/N_i is given to the greedy action a*. The value of ε can be a changing value, which gradually decreases as the episode number k increases, to increase the level of greediness. Clearly, some actions may have the same maximum action value. In this case, these actions have the same probability of being chosen as the greedy action. Based on the above policy, each possible action may be selected as the actual action to be executed at s_i. Note that the possible actions do not include those which will lead to systems being beyond the ending condition, for example, the FOV exceeds the required system specification or the design freedoms exceed the allowable range. When the actual action is determined, it is executed and the system is optimized. Then, the next state s_i₊₁ is achieved. Then i = i+1. If an ending state is not achieved, ω=ω+1. Repeat steps (4) and (5) until an ending state is achieved.
(6) If an ending state is achieved, the current episode is complete. Evaluate the reward value R of the current system. Revisit all the state-action pairs which the design route passed in the kth episode. For a state-action pair (s_i, a_j), if R ≥ q_π(s_i, a_j), then update q_π(s_i, a_j) to R. When all the action values have been updated, a single episode is completed. i = i+1, k = k+1 and ω=1. The reward values of the intermediate systems after each design step can also be evaluated and recorded for further use.
(7) Repeat the design steps (3)-(6) until the maximum episode number k_max is achieved. The flowchart of the method is shown in Fig. 2.

Fig. 2. Flowchart of the reinforcement learning method.

Download Full Size | PDF

A tree structure can be used to illustrate the RL process in a clearer manner. The entire RL process can be divided into three key steps, as shown in Fig. 3.

(a) Exploitation-Exploration: Starting from the root node (initial state), a design route is completed in an episode using an exploitation-exploration process. At one node, an action that has been executed previously may be selected (exploitation), while a new action may also be selected and a new node may be found (exploration). An ε-greedy policy is used to choose the action at each node.
(b) Evaluation: If an ending node is achieved, evaluate the reward value R of this node (state).
(c) Backup: Back up the reward value R to update the action values of the state-action pairs through which the design route passes. Then the design returns to the exploitation-exploration step.

Fig. 3. Tree structure showing the “exploitation-exploration, evaluation and back-up” steps.

Download Full Size | PDF

The above RL process can be performed automatically. Designers or engineers only need to establish the initial system, determine the design constraints and compile the optimization code. The best design route and the corresponding the system design result among the episodes can be output by the RL process. The design framework does not remove physics during the design, but it interacts with the actual optical design environment through the entire design process. An existing Q-table can be taken as the initial Q-table to continue the RL of the same design problem rather than an empty table. Thus, the RL process can stop or proceed at any time based on actual need. Additionally, in traditional optical design, design experience is generally obtained through extensive design work by designers. Using the proposed framework, it is possible to summarize optical design experience (e.g., how to choose specific design operations at different types of states and how to arrange the different design operations in the design route to obtain a better design result) for other similar design tasks through an automated RL process and related data analysis (see the Example demonstration section). Thus, human effort can be significantly reduced.

The result of the RL process for an existing system can be used to directly accelerate the RL process of another new design task, whose system specifications are smaller and within those of the existing design. This is because a great deal of design work has already been completed when designing systems with higher system specifications. The RL process is similar to the steps given above. The difference is that, in each episode, we do not need to choose a specific action and optimize the system step by step. We can simply go through the recorded design route of each episode from the initial state to an ending state of the new design. The reward value (which may have already been obtained) of the ending state can be used to update the new Q-table. If no ending state is founded in a design route, the Q-table will not be updated after this episode. Using the above steps, a Q-table of the new design can be generated very quickly and it can be used as the initial Q-table for a further RL process for the new system.

3. Example demonstration

We use the design of the Wetherell/Womble configuration [22] freeform off-axis reflective triplet with a large FOV to validate our proposed design framework. The system works under the visible spectral band. The focal length of this system is 120 mm and the entrance pupil diameter (EPD) is 34.3 mm. The FOV of this system is 20°×20°. M2 is the aperture stop. All three mirrors (M1, M2, M3) are freeform surfaces and the surface type is the XY polynomials surface up to sixth order with a base conic. The freeform surface can be written as

(1)$$\begin{array}{l} z(x,y) = \frac{{c({x^2} + {y^2})}}{{1 + \sqrt {1 - (1 + k){c^2}({x^2} + {y^2})} }} + {A_3}{x^2} + {A_5}{y^2} + {A_7}{x^2}y\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + {A_9}{y^3} + {A_{10}}{x^4} + {A_{12}}{x^2}{y^2} + {A_{14}}{y^4} + {A_{16}}{x^4}y + {A_{18}}{x^2}{y^3} + {A_{20}}{y^5}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + {A_{21}}{x^6} + {A_{23}}{x^4}{y^2} + {A_{25}}{x^2}{y^4} + {A_{27}}{y^6} \end{array}, $$

where c is the base curvature of the surface at the vertex, k is the conic constant, and A_i is the coefficient of the x-y terms. Because the optical system is symmetric about the YOZ plane, only the even items of x in the XY polynomials are used.

An initial system matching the first-order requirement is first established for the subsequent RL process using decentered and tilted spheres, as shown in Fig. 4(a). The first-order focal length and entrance pupil diameter of the initial system are the same as the final design requirement. The FOV of the initial system is 4°×4°. The imaging performance of this system is very low. A successive optimization method can be used to get the final design result. During the design process the surface coefficients are added as variables gradually and the FOV in the x and y directions are expanded successively. The system optimization process is performed in optical design software CODE V. The design constraints are established before the optimization, including the constraint to maintain the focal length, control distortion and eliminate light obscuration. The focal lengths of the system in the x and y directions are calculated using the ABCD matrix method and then controlled. The relative distortion of the system is controlled to be within 5% in both the x and y directions using real ray tracing data. The marked distances L₁ to L₅ shown in Fig. 4(a) have to be controlled to eliminate light obscuration or avoid surface interference. The distances between M1 and M2, and M2 and M3 are controlled so they are not too large. The chief ray of the central field is controlled to intersect at the center of each freeform surface (the vertex of the freeform surface). The conic constants of the freeform surfaces are controlled to be within ±10. The error function type used in the optimization is the default transverse ray aberration type in CODE V. Step optimization mode is selected. The derivative increments are computed using finite differences. Generally, there are five possible actions at a specific state: increase the surface order of M1 (Ψ₁) by 1 and optimize the system; increase the surface order of M2 (Ψ₂) by 1 and optimize the system; increase the surface order of M3 (Ψ₃) by 1 and optimize the system; increase the FOV in the x direction (XFOV) by 4° and optimize the system; and increase the FOV in the y direction (YFOV) by 4° and optimize the system. The actions are summarized in Table 1. The definition of the surface order of one surface in this paper is summarized in Table 2. Actions that lead to the surface order of one surface > 6 or FOV in one direction > 20° at one state are forbidden.

Fig. 4. Initial systems. (a) Initial system for the original design. (b) Initial system for the new designs (EPD=34.3 mm).

Download Full Size | PDF

Table 1. Actions used in the design example.

View Table | View all tables in this article

Table 2. Definition of the surface order.

View Table | View all tables in this article

A vector s is used to characterize the system after each design step. Ψ₁, Ψ₂, Ψ₃, XFOV and YFOV are taken as the components in vector s. However, these are not sufficient, because these data offer no information about the surface shapes and locations. Therefore, some other components are used: (1) The first-order focal lengths in the x and y directions of mirror i (EFLX_i and EFLY_i, i=1,2,3, unit: mm); these data are used to represent the surface powers (shapes) and can be calculated using the ABCD matrix. ABCD matrix is a ray transfer matrix that represents derivatives of the base ray configuration at an ending surface with respect to changes in the position and direction of a base ray at a starting surface. The base ray is generally the chief ray of the central field. All first-order properties in the optical system, including focal length, magnification, and locations of Gaussian pupils, can be determined by using ABCD matrices [23,24]. The first-order focal lengths of a surface (setting it to be both the start surface and the end surface) can also be approximately calculated. Detailed descriptions of the ABCD matrix and related computation method can be found in Refs. [23] and [24]. Except using ABCD matrix, there are other methods to calculate the focal length of a surface, e.g., finding the best-fitting sphere and get the focal length. Using only focal length is of course not sufficient to describe the accurate shape of a freeform surface. However, as we are concerning about the rough state of the system during the reinforcement learning process, an approximate estimate of the surface shape using surface power is enough, which can be characterized by the focal length of the surface. If we employ all the surface coefficients used in the surface representation to describe the shape of the freeform surface, the number of state components will be very large; in addition, the surfaces with a similar shape may have totally different values for some surface coefficients. The above issues will make the RL process very difficult to be conducted. (2) The global y-decenter, z-decenter and α-tilt of mirror i (YDE_i, ZDE_i and ADE_i, i=1,2,3, unit: mm or degree) with respect to a predefined global coordinate system; these data are used to represent the surface locations. In particular, YDE₂ and ZDE₂ are set to zero directly and are frozen during the design. As discussed above, the state components that denote the optical powers and the surface locations should be numbers that indicate the ranges in which the exact values fall, but not the exact values. Here, further data processing is applied to these values:

(2)$$EFL{X_{i,\Gamma }} = \left\{ {\begin{array}{l} {\textrm{round}\left( {\frac{{\textrm{sigmoi}{\textrm{d}^{\ast }}({1000} )}}{{{\Delta _{EFL}}}}} \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} if{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} EFL{X_i} \ge 1000{\kern 1pt} {\kern 1pt} \textrm{mm}{\kern 1pt} {\kern 1pt} {\kern 1pt} or{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} EFL{X_i} \le - 1000{\kern 1pt} {\kern 1pt} \textrm{mm}}\\ {\textrm{round}\left( {\frac{{\textrm{sigmoi}{\textrm{d}^{\ast }}({EFL{X_i}} )}}{{{\Delta _{EFL}}}}} \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} if{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} - 1000{\kern 1pt} {\kern 1pt} \textrm{mm} < EFL{X_i} < 1000{\kern 1pt} {\kern 1pt} \textrm{mm}} \end{array}} \right.{\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,3, $$

(3)$$EFL{Y_{i,\Gamma }} = \left\{ {\begin{array}{{l}} {\textrm{round}\left( {\frac{{\textrm{sigmoi}{\textrm{d}^{\ast }}({1000} )}}{{{\Delta _{EFL}}}}} \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} if{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} EFL{Y_i} \ge 1000{\kern 1pt} {\kern 1pt} \textrm{mm}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} or{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} EFL{Y_i} \le - 1000{\kern 1pt} {\kern 1pt} \textrm{mm}}\\ {\textrm{round}\left( {\frac{{\textrm{sigmoi}{\textrm{d}^{\ast }}({EFL{Y_i}} )}}{{{\Delta _{EFL}}}}} \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} if{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} - 1000{\kern 1pt} {\kern 1pt} \textrm{mm} < EFL{Y_i} < 1000{\kern 1pt} {\kern 1pt} \textrm{mm}} \end{array}} \right.{\kern 1pt} {\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,3, $$

(4)$$YD{E_{i,\Gamma }} = \textrm{round}\left( {\frac{{YD{E_i}}}{{{\Delta _{YDE}}}}} \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} ZD{E_{i,\Gamma }} = \textrm{round}\left( {\frac{{ZD{E_i}}}{{{\Delta _{ZDE}}}}} \right),{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} AD{E_{i,\Gamma }} = \textrm{round}\left( {\frac{{AD{E_i}}}{{{\Delta _{ADE}}}}} \right){\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,3, $$

where Δ_EFL=Δ_YDE=Δ_ZDE=60 mm, Δ_ADE=15°. We use the subscript Γ to denote the range number. The function round() represents rounding down the input value to an integer. The function sigmoid*(x) is user-defined function which is obtained by stretching and translating the original sigmoid function. The sigmoid*(x) is

(5)$$\textrm{sigmoi}{\textrm{d}^{\ast }}(x) = 1000 \cdot \left[ {\frac{1}{{1 + {\textrm{e}^{ - \frac{x}{{250}}}}}} - 0.5} \right]. $$

Thus, the surface focal lengths and locations are transformed into integer values that represent the data range, and these values are added into the vector s. To summarize, the state vector s for this design problem can be written as:

(6)$$\begin{array}{l} {\boldsymbol s} = [{\Psi _1},{\kern 1pt} {\kern 1pt} {\Psi _2},{\kern 1pt} {\Psi _3},XFOV,YFOV,EFL{X_{1,\Gamma }},EFL{Y_{1,\Gamma }},EFL{X_{2,\Gamma }},EFL{Y_{2,\Gamma }},\\ EFL{X_{3,\Gamma }},EFL{Y_{3,\Gamma }},YD{E_{1,\Gamma }},ZD{E_{1,\Gamma }},AD{E_{1,\Gamma }},AD{E_{2,\Gamma }},YD{E_{3,\Gamma }},ZD{E_{3,\Gamma }},AD{E_{3,\Gamma }}] \end{array}. $$

An ending state refers a state whose [Ψ₁, Ψ₂, Ψ₃, XFOV, YFOV] equals [6, 6, 6, 20°, 20°].

The RL process demonstrated in the previous section is used for this design task. In each episode, the design starts from the initial system (initial state). At each state during the optimization process, one action is selected and executed. When an ending state is achieved, the RMS wavefront errors of all the sample fields are calculated and the reward value R is used as the reciprocal of the average RMS wavefront error. Then the action values in the Q-table are updated and a single episode is completed. If a ray tracing error or some other fatal errors leading to design failure occur during the optimization, the design in this episode terminates immediately and R is assigned as 0. The value of ε is 0.2 in the first episode and decreases with a uniform speed to 0.1 in the final episode. A total of 2000 episodes are employed in the RL process. The entire design process was conducted automatically on a personal computer using an Intel Core i7-7700 CPU @ 3.60 Hz and 32GB internal memory. The total elapsed time was approximately 62 hours. The obtained Q-table can be taken as the initial Q-table for the continued RL of this design task.

The average reward value of the previous 30 episodes starting from episode number k=30 is plotted in Fig. 5, which has the trend of increasing as training progresses. Different design routes and corresponding design results were explored during the RL process. Except the episodes which lead to design results with a ray tracing error or other fatal errors leading to design failure, the worst design result had a final reward value of 2.87. The system layout and its MTF plot are given in Figs. 6(a) and (b). The best design result had a final reward value of 76.95. The system layout and its MTF plot are given in Figs. 6(c) and (d). Some other design results are shown in Figs. 6(e)-(g), and the reward values were 16.88, 30.43 and 40.8, respectively.

Fig. 5. Average reward value of the previous 30 episodes starting from episode number k=30.

Download Full Size | PDF

Fig. 6. Typical design results. (a) and (b) Layout and MTF of the worst design. (c) and (d) Layout and MTF of the best design. (e)-(g) Some other design results.

Download Full Size | PDF

In addition to the good design results, the design experience of this type of system can be acquired through the RL process and the obtained data. The key experience is the judgement of the final imaging performance of different design routes. A design route can be described as the order of the appearance of actions #1 to #5 in the route starting from the initial system. In addition to observing the different design routes and the final reward values directly, and then summarizing the experience, other methods can be used to simplify this process. There are 26 design steps (from step 1 to step 26) in each design route. Each action from #1 to #5 appears in a design route multiple times. For each action #μ (μ=1,2,3,4,5) in a design route, we can find the exact step numbers where action #μ appears and then calculate the mean value #μ_mean and standard deviation #μ_stddev. Thus, we can define a feature vector F_k for the design route corresponding to episode k.

(7)$${{\boldsymbol F}_k} = [\# {1_{\textrm{mean}}},{\kern 1pt} {\kern 1pt} \# {2_{\textrm{mean}}},{\kern 1pt} {\kern 1pt} \# {3_{\textrm{mean}}},{\kern 1pt} {\kern 1pt} \# {4_{\textrm{mean}}},{\kern 1pt} {\kern 1pt} \# {5_{\textrm{mean}}},\# {1_{\textrm{stddev}}},\# {2_{\textrm{stddev}}},\# {3_{\textrm{stddev}}},\# {4_{\textrm{stddev}}},\# {5_{\textrm{stddev}}}]. $$

This vector can be used to characterize the design route in each episode. For the same component in the feature vector of different episodes, the values of this component, in addition to the corresponding final reward values can be shown in a scatter plot. Figure 7 shows the scatter plots for all 10 components. Some experience can be concluded from these figures directly. All the actions are not required to appear in the design process in a specific order (not required to complete all the actions of one type and then another), but should be done in an alternating manner. For actions #1, #2, and #3 (increasing the surface order), it is recommended that these actions are dispersed throughout the entire design process. Implementing the overall distribution of action #1 in the early or late stage of the design process is not recommended. Implementing the overall distribution of action #2 in the early stage is not recommended. It is better for the overall appearance of action #3 to be slightly earlier than #1 and #2; implementing it in the late stage and is not recommended. For actions #4 and #5, it is better for the overall appearance of action #4 (increasing the XFOV) to be in the late stage of the design process, whereas it is better for the overall appearance of action #5 (increasing the YFOV) to be in the early stage of the design process; implementing it in the late stage is not recommended.

Fig. 7. Scatter plots of the values of all 10 components in the feature vector and the corresponding final reward values.

Download Full Size | PDF

Design experience can also be summarized in other ways. An example is to use clustering. Clustering is an unsupervised machine learning technique that involves the grouping of data points. Data points that are in the same group should have similar properties and/or features. A clustering algorithm can be used to classify each design route into a specific group. Here, K-means clustering is used as an example [25]. The “data point” used for clustering is the feature vector F of each design route (which can be seen as a data point in 10-dimensional space). Among all the design routes explored in the RL process, the routes which lead to ray tracing failure or other design failures are not involved in clustering. Additionally, the same feature vector only appears once in the dataset of clustering. Thus, a total of 1247 different feature vectors are input into the K-means clustering algorithm. The number of different groups is 300. After clustering, similar feature vectors (similar design routes) are grouped together. By observing and comparing the reward values of the design routes in different groups, related design experience can be obtained (e.g., which types of design route lead to good performance).

The optical design experience acquired by the above process can be used for other similar design tasks. The design routes that lead to good/bad final design results in the original design will also lead to good/bad performance for similar systems, to some extent. Other designs of off-axis three-mirror systems are used as examples. Three different system specifications are used: the same specifications as the original design, system specifications with a larger EPD (EPD=40 mm, FOV=20°×20°, focal length=120 mm), and system specifications with a larger FOV (EPD=30 mm, FOV=24°×24°, focal length=120 mm). These designs also use the successive optimization process and start from a new initial system. The FOV of the initial system is 4°×4° and the first-order focal length is approximately 120 mm. For the design of the system with the same specifications as the original design, the initial system with EPD=34.3 mm is as shown in Fig. 4(b). The other two designs also use this initial system but with EPD=40 mm and 30 mm, respectively.

From the original design problem, we chose 8 different groups (Group Nos. 167, 227, 81, 186, 29, 147, 149 and 88) out of 300 groups generated using clustering. The average reward value of each group varied from low to high and is plotted in Fig. 8. The same design routes in each group are applied to the three cases to obtain the final design results (for the design with larger FOV, actions #4 and #5 increased the FOV in one direction by 5° rather than 4°). The average reward values of different design routes in these groups were calculated, and are also plotted in Fig. 8. It can be seen that the overall changing trends of the average reward values for the three designs were similar to that of the original design. This validates that it is feasible for design experience to be used to guide the design of other systems with similar design requirements. To summarize, using the RL process and some subsequent data processing techniques, good design results can be explored and related design experience can be obtained automatically. Human effort can be significantly reduced. The experience can be used in the design of other systems.

Fig. 8. Average reward value of the design routes in several groups for the original and new designs.

Download Full Size | PDF

The result of the RL process of the original system can be used to directly accelerate the RL process of another new design task whose system specifications are smaller and within those of the original design (e.g., a system has the same EPD and focal length as the original design but an FOV of 16°×12°). A Q-table for the new design can be generated very quickly using the recorded data of the original design and it can be used as the initial Q-table for a further RL process for the new system.

All code for RL and clustering in the design example demonstration is written in Python. CODE V is executed within Python code through the CODE V API to execute optimization and imaging performance evaluation. The CODE V API is an application programming interface designed to allow programmatic interaction between CODE V and other programs [23]. It provides a set of interface functions that can be used by the client Microsoft Windows standard Component Object Model (COM) object to interact with an instance of CODE V.

4. Conclusions and discussions

We proposed a design framework for freeform imaging systems, particularly systems with advanced system specifications, using model-free reinforcement learning. An “exploitation-exploration, evaluation and back-up” approach is used to interact with the environment and to complete different design routes under an ε-greedy policy. Design results with good imaging performance and corresponding design routes can be founded automatically. Using the obtained data as well as specific analysis methods such as clustering-based machine learning, design experience can be further acquired and offer insight for current and other related system design tasks. Thus, human effort can be significantly reduced in both the design process and the tedious process of summarizing experience. The proposed design framework can be integrated into optical design software while running non-stop in the background or on physical/cloud servers to complete design tasks and acquiring experience automatically for various types of systems. Additionally, this design framework is not restricted to freeform imaging systems, but can be applied to a wide range of systems, including traditional rotationally symmetric systems.

The efficiency and effect of the proposed design framework can be further improved using other techniques. For example, (1) the RL process can be accelerated using parallel computing and/or other advanced programming techniques; and (2) the existing knowledge or experience of the designers can be integrated into the RL process by adjusting the initial Q-table or learning policies accordingly. A knowledge of nodal aberration theory can also be used to guide the action selection process. Additionally, the design framework can be integrated into existing optical design software. During the normal optimization process handled by designers, a computer or software can record the design operations, design routes, and corresponding design results. These records and data can help designers or computers to obtain higher efficiency and better design results using the RL process, and benefit the subsequent experience acquisition process. The above techniques can be seen as imitation learning, to some extent. (3) In this paper, CODE V was used to optimize the system in each design step in the RL process. If other faster and more powerful optimization algorithms are used, better design results can be achieved with a faster speed. (4) Temporal-difference learning methods (e.g., Q-learning and Sarsa), which combine Monte-Carlo ideas and dynamic programming, can be used to increase the efficiency of the RL process. (5) If proper standards and formats are established, the design experience or the Q-tables can be uploaded and shared in the online optical design community or built into the optical design software. This will significantly promote the development of the automated and intelligent optical system design field. (6) DL and neural networks can be introduced into the RL framework as a function approximator [26]. Thus, the complex and large Q-table can be replaced and the RL process can be significantly simplified.

Current design framework based on RL given in the method and example section has some limitations (such as fixed number of design steps). In addition, for some complex systems, many other design skills or steps should be used during the design process to obtain good image quality, such as modifying the weights of each field, gradually adding or modifying constraints, etc. The above design steps can be seen as actions and they have the potential to be integrated into the proposed design framework. Furthermore, it is difficult for optical design frameworks to be improved for completely free people from participation. But the human involvement can be reduced to a very low level.

Imaging system consisting of phase elements has many advantages and related design methods have been proposed [27–29]. For example, In Ref. [27], a simple and flexible design method of the holographically printed freeform optics is proposed and it has been successfully used in the design and development of augmented reality (AR) near-eye display system. For this kind of system design task, the method in [27] may be better than RL-based method proposed in this paper, as the design requires careful determination of the mirror sizes and distances between mirrors, while the phase profiles of multiple mirrors integrated on a single element should be calculated. However, the RL-based method proposed in this paper can also be used in the design of complex imaging systems containing phase elements, especially for systems with advanced system specifications. The traditional design and optimization of the phase elements, including their positions and phase profiles (similar with the geometric surface shapes) also requires extensive human effort and experience. Using the proposed framework, it is possible to generate good design results automatically. Human effort can be significantly reduced in the design and experience summarizing process.

Funding

National Key Research and Development Program of China (2017YFA0701200); National Natural Science Foundation of China (61805012); Young Elite Scientist Sponsorship Program by CAST (2019QNRC001).

Acknowledgments

We thank Synopsys for the educational license of CODE V.

Disclosures

The authors declare no conflicts of interest.

References

1. K. P. Thompson and J. P. Rolland, “Freeform Optical Surfaces: A Revolution in Imaging Optical Design,” Opt. Photonics News 23(6), 30–35 (2012). [CrossRef]

2. S. Wills, “Freeform Optics: Notes from the Revolution,” Opt. Photonics News 28(7), 34–41 (2017). [CrossRef]

3. D. Cheng, Y. Wang, H. Hua, and M. M. Talha, “Design of an optical see-through head-mounted display with a low f-number and large field of view using a freeform prism,” Appl. Opt. 48(14), 2655–2668 (2009). [CrossRef]

4. A. Wilson and H. Hua, “Design and demonstration of a vari-focal optical see-through head-mounted display using freeform Alvarez lenses,” Opt. Express 27(11), 15627–15637 (2019). [CrossRef]

5. P. Benitez, J. C. Miñano, P. Zamora, D. Grabovičkić, M. Buljan, B. Narasimhan, J. Gorospe, J. López, M. Nikolić, E. Sánchez, C. Lastres, and R. Mohedano, “Advanced freeform optics enabling ultra-compact VR headsets,” Proc. SPIE10335, 103350I (2017).

6. L. Gu, D. Cheng, Y. Liu, J. Ni, T. Yang, and Y. Wang, “Design and fabrication of an off-axis four-mirror system for head-up displays,” Appl. Opt. 59(16), 4893–4900 (2020). [CrossRef]

7. Z. Qin, S. Lin, K. Luo, C. Chen, and Y. Huang, “Dual-focal-plane augmented reality head-up display using a single picture generation unit and a single freeform mirror,” Appl. Opt. 58(20), 5366–5374 (2019). [CrossRef]

8. J. Zhu, W. Hou, X. Zhang, and G. Jin, “Design of a low F-number freeform off-axis three-mirror system with rectangular field-of-view,” J. Opt. 17(1), 015605 (2015). [CrossRef]

9. A. Bauer, E. M. Schiesser, and J. P. Rolland, “Starting geometry creation and design method for freeform optics,” Nat. Commun. 9(1), 1756 (2018). [CrossRef]

10. T. Yang, G. Jin, and J. Zhu, “Automated design of freeform imaging systems,” Light: Sci. Appl. 6(10), e17081 (2017). [CrossRef]

11. D. Reshidko and J. Sasian, “A method for the design of unsymmetrical optical systems using freeform surfaces,” in Optical Design and Fabrication 2017 (Freeform, IODC, OFT), OSA Technical Digest (online) (Optical Society of America, 2017), paper JW1B.2.

12. M. Beier, J. Hartung, T. Peschel, C. Damm, A. Gebhardt, S. Scheiding, D. Stumpf, U. D. Zeitner, S. Risse, R. Eberhardt, and A. Tünnermann, “Development, fabrication, and testing of an anamorphic imaging snap-together freeform telescope,” Appl. Opt. 54(12), 3530–3542 (2015). [CrossRef]

13. W. Jahn, M. Ferrari, and E. Hugot, “Innovative focal plane design for large space telescope using freeform mirrors,” Optica 4(10), 1188–1195 (2017). [CrossRef]

14. K. Fuerschbach, J. P. Rolland, and K. P. Thompson, “A new family of optical systems employing φ-polynomial surfaces,” Opt. Express 19(22), 21919–21928 (2011). [CrossRef]

15. J. Reimers, A. Bauer, K. P. Thompson, and J. P. Rolland, “Freeform spectrometer enabling increased compactness,” Light: Sci. Appl. 6(7), e17026 (2017). [CrossRef]

16. Y. Nie, R. Mohedano, P. Benítez, J. Chaves, J. C. Miñano, H. Thienpont, and F. Duerr, “Multifield direct design method for ultrashort throw ratio projection optics with two tailored mirrors,” Appl. Opt. 55(14), 3794–3800 (2016). [CrossRef]

17. C. Gannon and R. Liang, “Using machine learning to create high-efficiency freeform illumination design tools,” arXiv preprint arXiv: 1903.11166v1 (2018).

18. T. Yang, D. Cheng, and Y. Wang, “Direct generation of starting points for freeform off-axis three-mirror imaging system design using neural network based deep-learning,” Opt. Express 27(12), 17228–17238 (2019). [CrossRef]

19. G. Côté, J. Lalonde, and S. Thibault, “Extrapolating from lens design databases using deep learning,” Opt. Express 27(20), 28279–28292 (2019). [CrossRef]

20. https://en.wikipedia.org/wiki/Reinforcement_learning.

21. R. S. Sutton and A. G. Barto, Reinforcement Learning: An introduction (MIT Press, Cambridge, MA, 2011).

22. W. B. Wetherell and D. A. Womble, “All-reflective three element objective,” U.S. Patent, 4,240,707 (1980).

23. CODE V Documentation Library, Synopsys Inc. (2018).

24. A. Gerrard and J.M. Burch, Introduction to Matrix Methods in Optics (Dover Publications, New York, 1975).

25. https://en.wikipedia.org/wiki/K-means_clustering

26. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature 518(7540), 529–533 (2015). [CrossRef]

27. J. Jeong, C. K. Lee, B. Lee, S. Lee, S. Moon, G. Sung, H. S. Lee, and B. Lee, “Holographically printed freeform mirror array for augmented reality near-eye display,” IEEE Photonics Technol. Lett. 32(16), 991–994 (2020). [CrossRef]

28. J. Mendes-Lopes, P. Benítez, J. C. Miñano, and A. Santamaría, “Simultaneous multiple surface design method for diffractive surfaces,” Opt. Express 24(5), 5584–5590 (2016). [CrossRef]

29. Y. Duan, T. Yang, D. Cheng, and Y. Wang, “Design method for nonsymmetric imaging optics consisting of freeform-surface-substrate phase elements,” Opt. Express 28(2), 1603–1620 (2020). [CrossRef]

Action number	Operation
#1	Increase Ψ₁ by 1 and optimize the system
#2	Increase Ψ₂ by 1 and optimize the system
#3	Increase Ψ₃ by 1 and optimize the system
#4	Increase XFOV by 4° and optimize the system
#5	Increase YFOV by 4° and optimize the system

Surface order	Surface type	Design variables
0	Sphere	c
1	Conic	c, k
2	2nd order XY polynomials surface	c, k, A₃, A₅
3	3rd order XY polynomials surface	c, k, A₃, A₅, A₇, A₉
4	4th order XY polynomials surface	c, k, A₃, A₅, A₇, A₉, A₁₀, A₁₂, A₁₄
5	5th order XY polynomials surface	c, k, A₃, A₅, A₇, A₉, A₁₀, A₁₂, A₁₄, A₁₆, A₁₈, A₂₀
6	6th order XY polynomials surface	c, k, A₃, A₅, A₇, A₉, A₁₀, A₁₂, A₁₄, A₁₆, A₁₈, A₂₀, A₂₁, A₂₃, A₂₅, A₂₇

Action number	Operation
#1	Increase Ψ₁ by 1 and optimize the system
#2	Increase Ψ₂ by 1 and optimize the system
#3	Increase Ψ₃ by 1 and optimize the system
#4	Increase XFOV by 4° and optimize the system
#5	Increase YFOV by 4° and optimize the system

Surface order	Surface type	Design variables
0	Sphere	c
1	Conic	c, k
2	2nd order XY polynomials surface	c, k, A₃, A₅
3	3rd order XY polynomials surface	c, k, A₃, A₅, A₇, A₉
4	4th order XY polynomials surface	c, k, A₃, A₅, A₇, A₉, A₁₀, A₁₂, A₁₄
5	5th order XY polynomials surface	c, k, A₃, A₅, A₇, A₉, A₁₀, A₁₂, A₁₄, A₁₆, A₁₈, A₂₀
6	6th order XY polynomials surface	c, k, A₃, A₅, A₇, A₉, A₁₀, A₁₂, A₁₄, A₁₆, A₁₈, A₂₀, A₂₁, A₂₃, A₂₅, A₂₇

Designing freeform imaging systems based on reinforcement learning

Abstract

1. Introduction

2. Method

2.1 Basic principles

2.2 Reinforcement learning and data analysis method

3. Example demonstration

4. Conclusions and discussions

Funding

Acknowledgments

Disclosures

References

Cited By

Figures (8)

Tables (2)

Equations (7)

Optics Express