Intelligent design of the chiral metasurfaces for flexible targets: combining a deep neural network with a policy proximal optimization algorithm

Xianglai Liao; Lili Gui; Ang Gao; Zhenming Yu; Kun Xu

doi:10.1364/OE.471629

1. Introduction

Chirality plays an essential role in human life especially at molecular level [1], and many biological molecules are chiral, such as DNAs and proteins. The field of optics offers valuable tools to probe the chirality of molecules, including circular dichroism (CD) and optical rotatory dispersion (ORD). The CD spectrum is especially useful for analyzing the structure and conformation of complex biomolecules [2]. Interestingly, chiral plasmonic metasurfaces could be used to significantly enhance the signal strength of natural chiral objects [3–5], paving the way for highly sensitive bio-sensing. Therefore, it is inspiring to study and boost the chiroptical responses of the metasurfaces themselves. The functionality of the chiral plasmonic metasurfaces depends on the structural parameters, which makes it crucial to find the optimal parameters efficiently. Apart from design methods based on the human intuition, studies have recently focused on efficient and intelligent ways to optimize nanostructures. A typical method is to join numerical simulations with heuristic algorithms, such as the genetic algorithm (GA) [6–8], simulated annealing algorithm (SA) [9], and particle swarm optimization (PSO) [10,11]. The limitation of these methods is that the computation complexity increases exponentially with the number of parameters [12]. Another novel approach is to adopt the neural network (NN) to help design the chiral structures, which predicts the geometry parameters of the chiral metasurfaces directly when optical spectra are given. Many research groups have tried to design chiral metasurfaces using NN or a deep neural network (DNN). W. Ma et al. have proposed an on-demand design model trained by deep learning to design the chiral metamaterials for the first time [13]. E. Ashalley and coworkers have built a multitask deep learning model to achieve the inverse design task of the 3D chiral metasurfaces [14]. Z. Tao and coworkers have adopted a DNN to directly predict the CD spectrum of 2D chiral metamaterials [15,16], and S. Du et al. have further proposed a novel algorithm to enhance the data [17]. In our previous work the transfer learning method [18] was utilized to accelerate the training process for chiral metasurface design, which transfers the knowledge of the network for one handedness to the other network for opposite handedness. However, preparing training data for that method can be labor-intensive because it includes many numerical simulations, data labelling, and pre-processing.

Reinforcement learning (RL) [19] is a popular artificial intelligent algorithm, which does not require training data. RL is different from supervised and unsupervised learning and can automatically learn the strategy from a series of states by interacting with a dynamic environment. The agent of the RL is able to determine the ideal behavior when receiving a new condition to maximize the global benefit. The learning method of RL is also more similar to human cognition process. In addition, the deep reinforcement learning (DRL) algorithm [20] that combines DNN with RL could enable RL achieve more complicated tasks.

Over the last few years, DRL has been applied to design nanophotonic devices. I. Sajedian and coworkers [21] have implemented a deep Q-learning network (DQN) model to design an optical dielectric nanostructure for color generation. DQN is a popular DRL model, which has been used to design many devices such as the moth-eye structure for broadband absorption [22], multilayer optical thin film [23], three-layer metamaterial with polarization-independent perfect solar absorption [24], high-transmission color filter achieved by dielectric nanostructures [25], and one-dimensional metagrating deflector [26]. Besides, a DQN with double network (DDQN) [27] has been adopted to increase the efficiency of metasurface holograms design [28]. To further improve the learning efficiency of the agent, an asynchronous DDQN algorithm has been used to design the multimode interference-based power splitter structure [29]. T. Shah and coauthors [30] have compared the DDQN and deep deterministic policy gradient (DDPG) algorithm for the metamaterial design, proving that policy-based RL performs better than the value-based RL. The multi-path deep Q-learning network (MP-DQN) has been developed to optimize the multi-layer thin films with discrete and continuous parameters [31]. Note that the DQN can only change one single parameter at one step with low search efficiency. In comparison, the proximal policy optimization (PPO) [32], a robust RL algorithm, is particularly suitable for optimizing multiple parameters simultaneously at a single step, and has been used in the design of multi-layer thin films [33]. We should mention that, most of the aforementioned works use value-based DRL methods to optimize the given nanostructure, and the environment that the agent interacts with is based on the time-consuming numerical simulation. The slow feedback speed of the environment hinders the learning efficiency of the agent, which limits the widespread adoption of the DRL-based intelligent design of metasurfaces.

Here, we model the chiral metasurface design process as a Markov decision process (MDP), and propose a deep learning-based virtual environment proximal policy optimization (DE-PPO) method to efficiently design the chiral plasmonic metasurfaces [34]. This enables us to quickly retrieve the specific structural parameters for a giant CD at an arbitrarily target wavelength within the spectral window of study. Our method offers the following advantages: Firstly, the DNN based virtual environment is proposed, which significantly speeds up the learning process of the agent and thus accelerates the design of photonic devices. In detail, the handedness identification forward prediction DNN (HI-DNN) is trained, which can output the CD spectrum with the size and the handedness of the chiral plasmonic metasurface as input. The pre-trained HI-DNN is used as a part of the environment to replace the numerical simulation. Secondly, the action decided by the agent can change all the structural variables simultaneously. Unlike the DQN, which only changes a single target parameter at one step, PPO can change all parameters within a single step, significantly reducing the size of the action and improving the exploration efficiency. Thirdly, an agent interacts with the environment to learn how to achieve flexible targets with only one training. The flexible targets in our paper refer to the ability of the learned policy to achieve any target in the given spectral window with one training. Instead of searching for a group of parameters satisfying a certain target with constraint just like heuristic algorithms, DRL can intelligently learn the ability of design. In other words, if an environment is given, the agent of DRL can learn like human and then master the ability to solve complex tasks by interacting with it. Fourthly, the DNN may have prediction errors compared to the numerical simulation, and the agent trained with the DNN environment can finely adjust its network weight by interacting with the numerical simulation environment again at the final steps to make more precise decisions in the actual implementation. With the DNN assistant environment, the PPO method can efficiently design the chiral plasmonic metasurfaces to find the given target (typically, less than 10 seconds with Intel Xeon W-2135 CPU at 3.70 GHz, and 32 GB installed memory). We envision that more versatile nanophotonic devices can be designed with considerable time efficiency via our proposed approach.

2. Structure and method

2.1 Chiral plasmonic metasurface

We demonstrate the function of DE-PPO by applying it to design the chiral metasurface described in [35]. The linear (nonlinear) chiroptical responses of the structure can be explained by the (extended) Born-Kuhn model, and the light-matter interaction of the archetypical Born-Kuhn type chiral metasurfaces might be analytically interpreted, which could be beneficial to an efficient structural design as well as practical biosensing applications. Figure 1(a) shows the unit cell of the periodic nanostructure, and it is comprised of corner-stacked orthogonal gold nanorods in $C_4$ symmetry, which avoids linear birefringence. The CD is defined as ${\textrm {CD}} = {T_{LCP}} - {T_{RCP}}$, where $T_{LCP}$ and ${T_{RCP}}$ represent the transmission spectra of left-handed circularly polarized (LCP) and right-handed circularly polarized (RCP) incident light respectively.

Fig. 1. Chiral metasurfaces and the deep neural network proposed in this paper. (a) Unit cell of the metasurfaces and circular dichroism (CD) definition. (b) The left-handed enantiomer and the right-handed enantiomer. (c) The architecture of the handedness identification DNN (HI-DNN) (consisting of four hidden layers). The left-handed and right-handed circularly polarized light are represented as LCP and RCP. Geometry parameters are also indicated in the following manner: $L$ for the length, $W$ for the width, $P$ for the period, $G$ for the gap, and $D$ for the distance. The identification of the metasurface handedness is represented as $I$, where 20 represents left handedness, and 40 represents right handedness.

Download Full Size | PDF

The chiral metasurfaces include left-handed (LH) and right-handed (RH) enantiomers, and the optimization variables include five important geometric parameters. The width $W$ and length $L$ of the nanorods, the period $P$ of the structure, the distance $D$ between the bottom of the upper layer and the top of the lower layers, and the gap $G$ between adjacent nanorods along the period direction all play important roles in the position and value of the maximum CD. To simplify the dataset collection process, we set the height $H$ of the nanorods as a constant (40 nm) according to [34]. The range of each dimensional parameter is pre-defined in Table 1. We use the sweeping parameter method to guarantee the optimal solution localized inside the parameter space as much as possible and further validate it through a genetic algorithm [36]. It is worth mentioning that the $P$ is indirectly obtained by equation $P = 2(L + Q) + G$ ($Q$ is a variable which ranges from 20 to 80 nm) so that the period can be constrained in a reasonable range (namely, always larger than the size of the unit cell). The 3D chiral plasmonic metasurface unit consists of eight nanorods covered by a dielectric medium. The refractive index of the dielectric material is 1.3 similar to the setting in our previous work [18,35], and the dielectric constant of the gold nanorods relies on measurement by Johnson and Christy [37]. The substrate is silica, and the surrounding ambience is regarded as air. Different arrangement of the nanorods can lead to left-handed (LH) and right-handed (RH) enantiomers, as illustrated in Fig. 1(b). The incident light impinges normally on the meta-structures. It is worth mentioning that the rounding corners are neglected in the electromagnetic simulation to reduce the simulation time, and rounding corners do not influence the investigation of the intelligent design.

Table 1. The range of each dimensional parameters

View Table | View all tables in this article

2.2 Forward prediction network

The HI-DNN is constructed to learn the relationship between the metasurface structures and the CD spectra. The structure of DNN is designed according to [15], as shown in Fig. 1(c). This DNN includes one input layer corresponding to the structure dimensional parameters, one output layer corresponding to the five hundred discrete points of the CD spectrum, and four identical hidden layers with 512 neuron nodes followed by batch normalization and dropout operations. It is worth mentioning that this network can predict the CD spectra for both LH and RH chiral metasurfaces by adding an identification node to the input layer of DNN to represent the chirality of the metasurfaces [The red frame in Fig. 1(c), denoted by $I$]. Instead of simply using 0 and 1 to represent the distinct handedness, we use 20 to represent LH and 40 to represent RH ($I = 20$ for LH, and $I = 40$ for RH) to avoid the large input value variance. It should be noted that $I$ can be two other values that have a large distinction within the parameters to range as the flags for LH and RH. Because the output nodes include both negative and positive values, the activation function of the hidden layers is chosen as Leaky Rectified Linear Unit (Leaky-ReLU) [38]. The loss function of the HI-DNN is mean square error (MSE), as shown in Eq. (1).

(1)$$\begin{array}{l} Los{s_{MSE}} = \frac{1}{{{N_b}}}\sum_{i = 1}^{{N_b}} {{{(CD_{pred}^{i} - CD_{true}^{i})}^{2}}} \end{array}$$

where $N_b$ is the size of a batch, $CD_{pred}^{i}$ is the CD spectrum predicted by HI-DNN, and $CD_{true}^{i}$ is the target that the DNN prediction needs to approach.

The training datasets consist of two parts: one half is the dataset for LH, and the other half is for RH. The finite-difference time-domain (FDTD) algorithm is adopted to perform numerical simulation for dataset generation. The total number of data points in the dataset is 45472. The dataset consists of the dimensional parameters of the structure as input and the corresponding CD spectra as target labels. The labeled handedness is also regarded as a part of input data. We then split the dataset into training, validation, and testing sub-dataset. HI-DNN learns the mapping relationship between the dimensional parameters and CD spectra through the training dataset, and separately online and offline tests the performance of the model through validation sub-dataset and testing sub-dataset. The batch normalization operation [39] follows the output of hidden layers to eliminate the influence of internal covariate shift caused by different length scales of dimensional parameters.

2.3 DE-PPO algorithm

The event that DRL decides must satisfy the Markov decision process (MDP). The agent determines the behavior it should execute with the knowledge from experience, and the next state is obtained by both current state and the action it takes. The MDP can be represented as a tuple $M = < S,A,{P_{s,a}},R >$, where $S$ is the collection of finite states, $A$ is the collection of finite action, ${P_{s,a}}$ is the model of state transaction, and is the instant reward of the environment. We denote the specific state as $s \in S$, action as $a \in A$. The behavior of producing action according to state is described as policy $a = \pi (s)$. The process of DRL is shown in Fig. 2(a).

Fig. 2. (a) Agent interacting with the environment in an MDP. (b) The environment based on HI-DNN.

Download Full Size | PDF

Agent and enviroent are critical components of DRL. The agent observes the state from the environment and executes the action based on its policy to change the environment while the environment feeds back a reward to the agent and provides a new state to the agent. The agent learns the policy by interacting with the environment, and it aims to find an optimal policy ${\pi ^{*}}$ which can maximize the expected accumulated reward. The accumulated reward is defined as follows.

(2)$$\begin{array}{l} U({s_0},{s_1},\ldots,{s_t}) = \sum_{t = 0}^{\infty} {{\gamma ^{t}}R({s_t},{a_t})} \end{array}$$

where $\gamma \in [0,1)$ is the discount factor, and ${a_t} = {\pi ^{*}}({s_t})$. Based on the accumulated reward, the state value function is defined as follows.

(3)$$\begin{array}{l} {V^{\pi} }(s) = {E^{\pi} }[R({s_t},{a_t}) + \gamma {V^{\pi} }({s_{t + 1}})|s = {s_t}] \end{array}$$

The environment is crucial for DRL and significantly influences the performance of the trained agent. We can directly use the FDTD as a part of the environment to respond to the action of the agent. The agent makes the movement to change the dimensional parameters of the structure, and then FDTD performs the numerical simulation where the CD spectrum is obtained. The environment calculates the reward according to the CD spectrum and feeds back a new observation state to the agent. However, it is time-consuming for FDTD to simulate a structure (several minutes), which significantly reduces the learning speed of the agent. To improve the response speed of the environment, we use the HI-DNN to replace the FDTD, as shown in Fig. 2(b). HI-DNN can predict the CD spectrum within a second, thus dramatically improving the learning efficiency of the agent. Besides, HI-DNN takes the handedness of the chiral metasurface enantiomers into account to train a more powerful agent to carry out complex tasks.

To design the chiral plasmonic metasurfaces for flexible targets, DE-PPO is proposed here, as shown in Fig. 3, to dramatically accelerate the training process. PPO is an actor-critic structure. The actor model executes the action, and the critic outputs the state value. The policy of the PPO is represented as ${\pi _\theta }({a_t}|{s_t})$, and it updates the weights to old policy ${\pi _{{\theta _{old}}}}({a_t}|{s_t})$ every $T$ time steps. The old policy ${\pi _{{\theta _{old}}}}({a_t}|{s_t})$ interacts with the environment to collect the experience $({s_t},{a_t},{r_{t + 1}},{s_{t + 1}})$. Here, a replay buffer is used to store the experiences, and the policy ${\pi _\theta }({a_t}|{s_t})$ samples a mini-batch of experiences to update the network weight $\theta$. The actor model updates the policy by maximizing the clip loss function, as shown in Eq. (4).

(4)$$\begin{array}{l} {L^{CLIP}}(\theta ) = {{\textrm{E}}_t}\left[ {\min \left( {\frac{{{\pi _\theta }({a_t}|{s_t})}}{{{\pi _{{\theta _{old}}}}({a_t}|{s_t})}}{A_t}({s_t},{a_t}),{\textrm{ }}clip\left( {\frac{{{\pi _\theta }({a_t}|{s_t})}}{{{\pi _{{\theta _{old}}}}({a_t}|{s_t})}},{\textrm{ 1 - }}\varepsilon {\textrm{, 1 + }}\varepsilon } \right){A_t}({s_t},{a_t})} \right)} \right] \end{array}$$

where ${\theta _{old}}$ is the vector of policy parameters before the update, and $\varepsilon = 0.1$ is a hyperparameter. The parameter $\theta$ is updated as $\theta = \theta - \alpha \nabla {L^{CLIP}}(\theta )$, where $\alpha = {10^{ - 4}}$ is the learning rate. ${A_t}({s_t},{a_t})$ is an estimator of the advantage function at time step $t$, as shown in Eq. (5).

(5)$$\begin{array}{l} {A_t}({s_t},{a_t}) = {\delta _t} + (\gamma \lambda ){\delta _{t + 1}} + {(\gamma \lambda )^{2}}{\delta _{t + 2}} \cdots {(\gamma \lambda )^{T - t + 1}}{\delta _{T - 1}} \end{array}$$

where $\gamma$ and $\lambda$ are hyper parameters ($\gamma = 0.99$, $\lambda = 0.95$), and the ${\delta _t}$ is obtained by Eq. (6).

(6)$$\begin{array}{l} {\delta _t} = {r_t} + \gamma {V_\mu }({s_{t + 1}}) - {V_\mu }({s_t}) \end{array}$$

Fig. 3. DE-PPO architecture for the design of chiral plasmonic metasurfaces.

Download Full Size | PDF

The objective (i.e., loss) function of critic ${L^{V}}$ is as follows,

(7)$$\begin{array}{l} {L^{V}}(\mu ) = {{\textrm{E}}_t}\left[ {|\bar V_\mu ^{t\arg et}({s_t}) - {V_\mu }({s_t})|} \right] \end{array}$$

where the target value of time-difference error (TD-error) is represented as Eq. (8).

(8)$$\begin{array}{l} \bar V_\mu ^{t\arg et}({s_t}) = {r_{t + 1}} + \gamma {V_\mu }({s_{t + 1}}) \end{array}$$

The parameters of ${V_\mu }$ are updated by a stochastic gradient descent (SGD) algorithm with the gradient $\nabla {L^{V}}$,

(9)$$\begin{array}{l} \mu = \mu - \eta \nabla {L^{V}}(\mu ) \end{array}$$

where $\eta = {10^{ - 4}}$ is the learning rate for the critic model optimization.

The agent needs to learn a strategy that can immediately achieve flexible targets by a sequence of actions. Thus, the task should change between many targets in the training process. The trained agent can finally decide on flexible targets without further training again. The design of the chiral plasmonic metasurfaces is modelled as Eq. (10).

(10)$$\begin{array}{l} opt.\max {\textrm{ }}{\cal P} = Random\{ abs(CD_{985{\textrm{ }}nm\sim 1250{\textrm{ }}nm}^{Random(LH,RH)})\} \\ s.t.{C_1}:D \in [20,{\textrm{ }}70],{\textrm{ }}L \in [100,{\textrm{ }}230]{\textrm{ }},\\ {\textrm{ }}W \in [30,{\textrm{ }}90],{\textrm{ }}G \in [20,{\textrm{ }}80]{\textrm{ ,}}\\ {\textrm{ }}Q \in [20,{\textrm{ }}80]\\ s.t.{C_2}:P = 2 \times (L + Q) + G \end{array}$$

${\cal P}$ is the optimization target of the DRL agent, which guarantees that the target spectral position for maximum absolute CD value and the handedness of the structure are changed randomly. $C_1$ implies the constraint for the range of dimensional parameters, and $C_2$ indicates that the period of the metasurface is indirectly calculated to guarantee that it is larger than the size of the unit cell. The state of the environment is given as a vector ${s_t} = ({D_t},{\textrm {\ }}{L_t},{\textrm {\ }}{W_t},{\textrm {\ }}{G_t},{\textrm {\ }}{Q_t})$. There are two possible actions for each dimensional parameter, namely, subtracting or adding 10 nm, and the total action is represented as a vector ${a_t} = (a_t^{1},a_t^{2},a_t^{3},a_t^{4},a_t^{5})$. The goal of the model is to find the structure that exhibits the most giant CD (absolute value) at whatever target wavelength we define. The reward is determined based on the CD at the target position, as shown in Eq. (11).

(11)$${r_t} = \left\{ {\begin{array}{cc} {\left| {CD[{\textrm{target}}]} \right| - 1} & {{\mathrm{if\,exceed\,boundary}}}\\ {\left| {CD[{\textrm{target}}]} \right| + 1} & {{\textrm{if }}\left| {CD{\textrm{[target]}}} \right|{\textrm{ > }}{{\textrm{T}}_{\textrm{1}}}}\\ {\left| {CD[{\textrm{target}}]} \right| + 5} & {{\textrm{if }}\left| {CD{\textrm{[target]}}} \right|{\textrm{ > }}{{\textrm{T}}_{\textrm{2}}}}\\ {\left| {CD[{\textrm{target}}]} \right| + 10} & {{\textrm{if }}\left| {CD{\textrm{[target]}}} \right|{\textrm{ > }}{{\textrm{T}}_{\textrm{3}}}}\\ {\left| {CD[{\textrm{target}}]} \right|} & {{\textrm{else}}} \end{array}} \right.$$

A more extensive CD will lead to a larger reward at the target position. Three thresholds are predefined as $T_1$, $T_2$, and $T_3$, say, ${T_1}=0.37$, ${T_2}=0.4$, ${T_3}=0.5$. An additional reward will be added if the CD at the target position is larger than the threshold. Besides, a punishment will be given if the dimensional parameters exceed the boundary. The DE-PPO algorithm is described in detail in Algorithm 1.

Algorithm 1. DE-PPO architecture for efficient deign of chiral plasmonic metasurfaces

View Table | View all tables in this article

3. Results

We verify our method on a personal computer with Intel Xeon W-2135 CPU at 3.70GHz, and 32 GB installed memory (RAM, random-access memory). We first train the DNN network with pre-collected datasets. The learning rate is set as 0.0001, and the optimizer is Adam [40]. The MSE losses of the training dataset and the validation dataset are shown in Fig. 4 (semi-log plot). The gradual reduction of the training loss indicates that the HI-DNN keeps learning the rule between the structure and its CD spectrum with iterations. The validation loss curve proves that the HI-DNN can accurately learn the knowledge behind the dataset. The training loss finally goes below $6.3 \times {10^{ - 5}}$, and the validation loss finally reaches $7 \times {10^{ - 5}}$.

Fig. 4. The MSE loss of the HI-DNN in the training dataset and validation dataset.

Download Full Size | PDF

The trained HI-DNN is offline tested using the structure data in the test dataset. Figures 5(a) and 5(b) show the CD spectra of the LH enantiomer metasurfaces with two different sets of given dimensional parameters, simulated by FDTD (red curves, ground-truth) and retrieved by DNN (blue dots), respectively. The tables inserted are corresponding dimensional parameters. The CD spectra retrieved by HI-DNN agree well with the results simulated by FDTD. It should be noted that the very narrow spectral features near 800 nm [particularly in Fig. 5(b)] are caused by Rayleigh anomaly [41]. In this case the DNN encounters relatively large errors because rare data that exhibit similar features exist in the training dataset. As shown in Figs. 5(c) and 5(d), the HI-DNN can also work perfectly in the case of the RH enantiomer metasurfaces.

Fig. 5. Comparison of the CD spectra of the metasurfaces retrieved by HI-DNN and simulated by FDTD respectively. (a-b) The CD spectra of the left-handedness chiral metasurfaces. (c-d) The CD spectra of the right-handedness chiral metasurfaces.

Download Full Size | PDF

The PPO agent interacts with the DNN based virtual environment and collects the experience online. The accumulated experiences are restored in the replay buffer; the agent utilizes the experiences in the replay buffer to train the actor and critic networks. The reward curve illustrates the performance of the DRL, as shown in Fig. 6. There are 2000 episodes to interact with the environment. At the beginning, the reward is small because the actor-network outputs random actions, but the reward gets larger with the increase of training episodes (among the first 700 episodes). Although there is fluctuation during the training process due to the unstable property of the DRL, we can save the model with the biggest reward.

Fig. 6. Training reward curve of the DE-PPO algorithm.

Download Full Size | PDF

After the agent has been trained, we set the target wavelength and handedness of the metasurface, and the agent can quickly find a structure that satisfies the target. Typically, it takes 100 steps within time scales of seconds. As an example, here we show four different tasks, i.e., set the target wavelengths to be 1100 nm, 1035 nm, 1000 nm, and 930 nm and choose LH enantiomers. Note that it is highly desired to achieve various target wavelengths for applications such as chiral sensing and spectroscopy. To strengthen the significance of our method, here we take two chiral sensing examples according to others’ work. Y. Zhao et al. have designed chiral metamaterials resonant at about 1000 nm to detect 1,2-Propanediol and Concanavalin A [42]. S. Both et al. have designed $\Omega$ antennas resonant at 1429.2 meV (869 nm) for deepening the physical insights of chiral sensing mechanism [43]. The results for achieving different target wavelengths are illustrated in Figs. 7(a)–7(d). The CD spectra and transmission spectra of LCP and RCP illumination are obtained by simulating the retrieved structures using FDTD. In the four tasks, the metasurfaces can all achieve maximum CD absolute values of about 0.4 at the target wavelengths, close to the performances of genetic algorithm based method [36] (0.49 for LH and 0.45 for RH).

Fig. 7. The exemplary performances of the DE-PPO algorithm. For target wavelength of (a) 1100 nm, (b) 1035 nm, (c) 1000 nm, and (d) 930 nm, the DE-PPO can make a reasonable decision without training again. Here the corresponding CD spectra and transmission spectra of the retrieved LH structures are obtained based on FDTD simulations.

Download Full Size | PDF

The PPO agent finds the dimensional parameters by a sequence of actions. The state transaction processes for the last two steps in Figs. 7(a)–7(d) are shown in Figs. 8(a)–8(d) respectively, which obey the MDP rule. Thus, the $s_t$ is is obtained by $s_{t-1}$ after the action is executed.

Fig. 8. The state transaction processes of DE-PPO when obtaining the results of Fig. 7, showing only one Markov state process.

Download Full Size | PDF

The DNN may have prediction errors compared with the numerical simulation, which influences the performance of the trained agent. The PPO agent may also be directly trained by interacting with the FDTD based environment, but it is time-consuming (far more than 48 hours). Besides, the PPO agent which directly interacts with the FDTD environment can only achieve a single target (namely, obtain an acceptable CD value at a target wavelength) with one training due to the constraint of the training time and the inflexibility of FDTD. An appropriate method to guarantee a comparable performance is to finely tune the model parameters of the agent with the FDTD environment, after trained with the DNN virtual environment. Figure 9 shows the dimensional structure parameters with corresponding optical spectra after this operation. The results are better than the case shown in Fig. 7, with additional cost of an acceptable computational time (about 3 8 hours for one single target).

Fig. 9. The optimization results after the FDTD environment finely tuned the agent. For target wavelength of (a) 1100 nm, (b) 1035 nm, (c) 1000 nm, (d) 930 nm, the DE-PPO can make a reasonable decision after interacting with the FDTD. The corresponding CD spectra and transmission spectra of the retrieved LH structures are obtained based on FDTD simulations.

Download Full Size | PDF

The RH chiral plasmonic metasurfaces can also be designed with the trained agent. We set the handedness as RH, and the flexible targets can also be achieved, as shown in Fig. 10. Here we again randomly choose four tasks, to set target wavelength to be 1100 nm, 1035 nm, 1000 nm, and 930 nm respectively. The agent was trained by DNN based virtual environment, and the designed structures exhibit satisfactory CD at the target wavelengths.

Fig. 10. The exemplary performances of the DE-PPO algorithm. For target wavelength of (a) 1100 nm, (b) 1035 nm, (c) 1000 nm, and (d) 930 nm, the DE-PPO can make a reasonable decision without training again. Here the corresponding CD spectra and transmission spectra of the retrieved RH structures are obtained based on FDTD simulations.

Download Full Size | PDF

To demonstrate the underlying physics, here we plot Fig. 11 that shows the normalized electric field and the current density of the RH chiral structure in Fig. 10 (b). The monitoring position is in the middle of the nanorods along the Z-direction. Figures 11(a) and (b) are the electric field |E| of the structure at the top layer and bottom layer respectively, at 1035 nm with LCP excitation. In comparison, Figs. 11(c) and (d) are electric field with RCP excitation. Note that the amplitude of the electric field with LCP excitation is smaller than that with RCP. The distribution of the current density in Figs. 11(e) and (f) clearly indicates the antibonding and bonding modes of the chiral metasurface.

Fig. 11. The electric field |E| and the current density of the structure in Fig. 10 (b). (a, b) The electric field of the unit cell at 1035 nm with LCP incident light, shown at (a) top layer of the structure, and (b) bottom layer of the structure, respectively. (c, d) The electric field of the unit cell at 1035 nm with RCP incident light, shown at (c) top layer of the structure, and (d) bottom layer of the structure, respectively. (e, f) The current density for incident light of (e) RCP at 1035 nm (f) LCP at 1250 nm. All plots are in the middle of the nanorods along the Z-direction.

Download Full Size | PDF

DNN can replace numerical calculations such as FDTD in the optimization process. Similar to our comparison perspectives, the articles [44,45] have emphasized the efficiency of NN or DNN after training. Besides the DRL algorithms, one can also combine DNN with the heuristic algorithms such as genetic algorithms (GA) to accelerate the design speed [46]. In Table 2, we compare several algorithms, including the PPO and GA with and without combining the DNN (after DNN is well trained). Compared with GA, PPO is more intelligent, and it has unique advantages in achieving flexible targets and data-driven decision making. Besides, it can also take handedness into account when designing chiral metasurfaces. The GA and PPO are time-consuming in the design or training process when combined with the numerical simulation algorithms (for example, FDTD). Forward prediction DNN can replace the FDTD. With the help of DNN, GA achieves the optimization tasks within 10 seconds, and the total iterations can be larger than $10^{6}$, which guarantees sufficient exploration. However, GA and other heuristic algorithms cannot intelligently implement arbitrary targets including variable handedness. The DRL has great potential in sophisticated tasks. Its potential in the intelligent design of chiral metasurfaces can be exploited with the aid of DNN based virtual environment (the training iteration can become much larger than $10^{6}$, while the training time is shorter than 30 minutes). Moreover, the performance of the DRL agent trained with the DNN based virtual environment can be further improved with the FDTD environment (less than 8 hours for a single target) to generate more reliable results.

Table 2. The comparison among different design methods (after DNN is well trained)

View Table | View all tables in this article

4. Conclusion

In conclusion, we have proposed the DE-PPO algorithm to inversely design the chiral plasmonic metasurfaces for flexible targets. The agent is trained to search for a group of dimensional parameters which satisfy the predefined target. A handedness identification DNN (HI-DNN) is trained with the dataset that reshuffles the LH and RH datasets. By using the HI-DNN as a part of the environment, the training time of the PPO agent can be significantly reduced. Furthermore, this method can solve new tasks without training the agent again, and a trained agent is available for both LH and RH metasurfaces. Since the PPO algorithm changes all the dimensional parameters simultaneously, it is more efficient compared to previous methods. Besides, it essentially reduces the output dimension of the network. The trained agent can interact with the DNN environment or FDTD environment to obtain a changeable target. Besides, The PPO is able to tune the dimensional parameters continuously. This study provides an efficient tool to explore the use of DRL in engineering the photonic devices, and the proposed method can be extended to design other nanophotonic components.

Funding

National Natural Science Foundation of China (61905018); Beijing Nova Program of Science and Technology (Z191100001119110); Fundamental Research Funds for the Central Universities (ZDYY202102-1); Fund of State Key Laboratory of Information Photonics and Optical Communications (Beijing University of Posts and Telecommunications) of China (IPOC2021ZR02); BUPT Excellent Ph. D. Students Foundation (CX2022214).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data Availability

Data presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. H. Y. Aboul-Enein, “Chirality at the nanoscale,” Chromatographia 70(9-10), 1523 (2009). [CrossRef]

2. T. Arvinte, T. T. T. Bui, A. A. Dahab, B. Demeule, A. F. Drake, D. Elhag, and P. King, “The multi-mode polarization modulation spectrometer: part 1: simultaneous detection of absorption, turbidity, and optical activity,” Anal. Biochem. 332(1), 46–57 (2004). [CrossRef]

3. J. M. Slocik, A. O. Govorov, and R. R. Naik, “Plasmonic circular dichroism of peptide-functionalized gold nanoparticles,” Nano Lett. 11(2), 701–705 (2011). [CrossRef]

4. B. M. Maoz, Y. Chaikin, A. B. Tesler, O. Bar Elli, Z. Y. Fan, A. O. Govorov, and G. Markovich, “Amplification of chiroptical activity of chiral biomolecules by surface plasmons,” Nano Lett. 13(3), 1203–1209 (2013). [CrossRef]

5. A. O. Govorov, Z. Y. Fan, P. Hernandez, J. M. Slocik, and R. R. Naik, “Theory of circular dichroism of nanomaterials comprising chiral molecules and nanocrystals: Plasmon enhancement, dipole interactions, and dielectric effects,” Nano Lett. 10(4), 1374–1382 (2010). [CrossRef]

6. C. Liu, S. A. Maier, and G. Li, “Genetic-algorithm-aided meta-atom multiplication for improved absorption and coloration in nanophotonics,” ACS Photonics 7(7), 1716–1722 (2020). [CrossRef]

7. Z. Li, D. Rosenmann, D. A. Czaplewski, X. Yang, and J. Gao, “Strong circular dichroism in chiral plasmonic metasurfaces optimized by micro-genetic algorithm,” Opt. Express 27(20), 28313–28323 (2019). [CrossRef]

8. C. Akturk, M. Karaaslan, E. Ozdemir, V. Ozkaner, F. Dincer, M. Bakir, and Z. Ozer, “Chiral metamaterial design using optimized pixelated inclusions with genetic algorithm,” Opt. Eng. 54(3), 035106 (2015). [CrossRef]

9. Y. Xie, M. Liu, T. Feng, and Y. Xu, “Compact disordered magnetic resonators designed by simulated annealing algorithm,” Nanophotonics 9(11), 3629–3636 (2020). [CrossRef]

10. R. Q. Yan, T. Wang, X. Jiang, Q. Zhong, X. Huang, L. Wang, and X. Yue, “Design of high-performance plasmonic nanosensors by particle swarm optimization algorithm combined with machine learning,” Nanotechnology 31(37), 375202 (2020). [CrossRef]

11. J. C. C. Mak, C. Sideris, J. Jeong, A. Hajimiri, and J. K. S. Poon, “Binary particle swarm optimized 2 x 2 power splitters in a standard foundry silicon photonic platform,” Opt. Lett. 41(16), 3868–3871 (2016). [CrossRef]

12. D. Melati, Y. Grinberg, M. K. Dezfouli, S. Janz, P. Cheben, J. H. Schmid, A. Sanchez-Postigo, and D. X. Xu, “Mapping the global design space of nanophotonic components using machine learning pattern recognition,” Nat. Commun. 10(1), 4775 (2019). [CrossRef]

13. W. Ma, F. Cheng, and Y. Liu, “Deep-learning-enabled on-demand design of chiral metamaterials,” ACS Nano 12(6), 6326–6334 (2018). [CrossRef]

14. E. Ashalley, K. Acheampong, L. V. Besteiro, P. Yu, A. Neogi, A. O. Govorov, and Z. M. Wang, “Multitask deep-learning-based design of chiral plasmonic metamaterials,” Photonics Res. 8(7), 1213 (2020). [CrossRef]

15. Z. Tao, J. Zhang, J. You, H. Hao, H. Ouyang, Q. Yan, S. Du, Z. Zhao, Q. Yang, X. Zheng, and T. Jiang, “Exploiting deep learning network in optical chirality tuning and manipulation of diffractive chiral metamaterials,” Nanophotonics 9(9), 2945–2956 (2020). [CrossRef]

16. Z. Tao, J. You, J. Zhang, X. Zheng, H. Liu, and T. Jiang, “Optical circular dichroism engineering in chiral metamaterials utilizing a deep learning network,” Opt. Lett. 45(6), 1403–1406 (2020). [CrossRef]

17. S. Du, J. You, J. Zhang, Z. Tao, H. Hao, Y. Tang, X. Zheng, and T. Jiang, “Expedited circular dichroism prediction and engineering in two-dimensional diffractive chiral metamaterials leveraging a powerful model-agnostic data enhancement algorithm,” Nanophotonics 10(3), 1155–1168 (2021). [CrossRef]

18. X. Liao, L. Gui, Z. Yu, T. Zhang, and K. Xu, “Deep learning for the design of 3d chiral plasmonic metasurfaces,” Opt. Mater. Express 12(2), 758–771 (2022). [CrossRef]

19. B. Modi and H. B. Jethva, “Reinforcement learning with neural networks: A survey,” in proceedings of first international conference on information and communication technology for intelligent systems: vol. 1, (2016), pp. 467–475.

20. X. Liao, X. Hu, Z. Liu, S. Ma, L. Xu, X. Li, W. Wang, and F. M. Ghannouchi, “Distributed intelligence: A verification for multi-agent drl-based multibeam satellite resource allocation,” IEEE Commun. Lett. 24(12), 2785–2789 (2020). [CrossRef]

21. I. Sajedian, T. Badloe, and J. Rho, “Optimization of colour generation from dielectric nanostructures using reinforcement learning,” Opt. Express 27(4), 5874–5883 (2019). [CrossRef]

22. T. Badloe, I. Kim, and J. Rho, “Biomimetic ultra-broadband perfect absorbers optimised with reinforcement learning,” Phys. Chem. Chem. Phys. 22(4), 2337–2342 (2020). [CrossRef]

23. A. Jiang, Y. Osamu, and L. Chen, “Multilayer optical thin film design with deep Q learning,” Sci. Rep. 10(1), 12780 (2020). [CrossRef]

24. I. Sajedian, T. Badloe, H. Lee, and J. Rho, “Deep Q-network to produce polarization-independent perfect solar absorbers: a statistical report,” Nano Convergence 7(1), 26 (2020). [CrossRef]

25. I. Sajedian, H. Lee, and J. Rho, “Design of high transmission color filters for solar cells directed by deep Q-learning,” Sol. Energy 195, 670–676 (2020). [CrossRef]

26. D. Seo, D. W. Nam, J. Park, C. Y. Park, and M. S. Jang, “Structural optimization of a one-dimensional freeform metagrating deflector via deep reinforcement learning,” ACS Photonics 9(2), 452–458 (2022). [CrossRef]

27. H. van Hasselt, A. Guez, and D. Silver Aaai, “Deep reinforcement learning with double Q-learning,” in Thirtieth AAAI conference on artificial intelligence, (2016), pp. 2094–2100.

28. I. Sajedian, H. Lee, and J. Rho, “Double-deep Q-learning to increase the efficiency of metasurface holograms,” Sci. Rep. 9(1), 10899 (2019). [CrossRef]

29. X. Xu, Y. Li, and W. Huang, “Inverse design of the mmi power splitter by asynchronous double deep Q-learning,” Opt. Express 29(22), 35951–35964 (2021). [CrossRef]

30. T. Shah, L. Zhuo, P. Lai, A. De La Rosa-Moreno, F. Amirkulova, and P. Gerstoft, “Reinforcement learning applied to metamaterial design,” J. Acoust. Soc. Am. 150(1), 321–338 (2021). [CrossRef]

31. H. Wankerl, M. L. Stern, A. Mahdavi, C. Eichler, and E. W. Lang, “Parameterized reinforcement learning for optical system optimization,” J. Phys. D: Appl. Phys. 54(30), 305104 (2021). [CrossRef]

32. Y. Wang, H. He, and X. Tan, “Truly proximal policy optimization,” in 35th uncertainty in artificial intelligence conference, vol. 115 (2020), pp. 113–122.

33. H. Wang, Z. Zheng, C. Ji, and L. Jay Guo, “Automated multi-layer optical design via deep reinforcement learning,” Mach. Learn.: Sci. Technol. 2(2), 025013 (2021). [CrossRef]

34. X. Yin, M. Schaferling, B. Metzger, and H. Giessen, “Interpreting chiral nanophotonic spectra: The plasmonic born-kuhn model,” Nano Lett. 13(12), 6238–6243 (2013). [CrossRef]

35. L. Gui, M. Hentschel, J. Defrance, J. Krauth, T. Weiss, and H. Giessen, “Nonlinear born-kuhn analog for chiral plasmonics,” ACS Photonics 6(12), 3306–3314 (2019). [CrossRef]

36. X. Liao, L. Gui, C. Wang, M. Feng, Z. Yu, T. Zhang, and K. Xu, “Efficient design of 3d chiral plasmonic metasurfaces assisted by intelligent algorithms,” in 2021 Photonics and Electromagnetics Research Symposium, (2021), pp. 779–787.

37. P. B. Johnson and R. W. Christy, “Optical constants of the noble metals,” Phys. Rev. B 6(12), 4370–4379 (1972). [CrossRef]

38. B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv:1505.00853 (2015).

39. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arxiv :1502.03167v3 (2015).

40. N. S. Keskar and R. Socher, “Improving generalization performance by switching from adam to sgd,” arXiv:1712.07628 (2017).

41. V. G. Kravets, F. Schedin, and A. N. Grigorenko, “Extremely narrow plasmon resonances based on diffraction coupling of localized plasmons in arrays of metallic nanoparticles,” Phys. Rev. Lett. 101(8), 087403 (2008). [CrossRef]

42. Y. Zhao, A. N. Askarpour, L. Sun, J. Shi, X. Li, and A. Alu, “Chirality detection of enantiomers using twisted optical metamaterials,” Nat. Commun. 8(1), 14180 (2017). [CrossRef]

43. S. Both, M. Schaferling, F. Sterl, E. A. Muljarov, H. Giessen, and T. Weiss, “Nanophotonic chiral sensing: How does it actually work?” ACS Nano 16(2), 2822–2832 (2022). [CrossRef]

44. L. Jiang, X. Li, Q. Wu, L. Wang, and L. Gao, “Neural network enabled metasurface design for phase manipulation,” Opt. Express 29(2), 2521–2528 (2021). [CrossRef]

45. L. Gao, X. Li, D. Liu, L. Wang, and Z. Yu, “A bidirectional deep neural network for accurate silicon color design,” Adv. Mater. 31(51), 1905467 (2019). [CrossRef]

46. D. Xu, Y. Luo, J. Luo, M. Pu, Y. Zhang, Y. Ha, and X. Luo, “Efficient design of a dielectric metasurface with transfer learning and genetic algorithm,” Opt. Mater. Express 11(7), 1852 (2021). [CrossRef]

Algorithm	Design time	Iteration	Flexible target	Flexible handedness	Training Time (compared after DNN is trained)	The CD of the LH metasurface at 1035 nm
PPO+FDTD	< 30 minutes	$< 10^{4}$	No	No	$>>$ 72 hours	-0.46
PPO+DNN	< 10 seconds	$> 10^{6}$	Yes	Yes	< 30 minutes	-0.43
PPO+DNN+FDTD	< 30 minutes	$> 10^{6}$	Yes	Yes	< 8 hours	-0.46
GA+DNN	< 10 seconds	$> 10^{6}$	No	No	No	-0.41
GA+FDTD	> 48 hours	$< 10^{4}$	No	No	No	-0.49

Algorithm	Design time	Iteration	Flexible target	Flexible handedness	Training Time (compared after DNN is trained)	The CD of the LH metasurface at 1035 nm
PPO+FDTD	< 30 minutes	$< 10^{4}$	No	No	$>>$ 72 hours	-0.46
PPO+DNN	< 10 seconds	$> 10^{6}$	Yes	Yes	< 30 minutes	-0.43
PPO+DNN+FDTD	< 30 minutes	$> 10^{6}$	Yes	Yes	< 8 hours	-0.46
GA+DNN	< 10 seconds	$> 10^{6}$	No	No	No	-0.41
GA+FDTD	> 48 hours	$< 10^{4}$	No	No	No	-0.49

Intelligent design of the chiral metasurfaces for flexible targets: combining a deep neural network with a policy proximal optimization algorithm

Abstract

1. Introduction

2. Structure and method

2.1 Chiral plasmonic metasurface

2.2 Forward prediction network

2.3 DE-PPO algorithm

3. Results

4. Conclusion

Funding

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (11)

Tables (3)

Equations (11)

Optics Express