Reinforcement learning based robust control algorithms for coherent pulse stacking

Abulikemu Abuduweili; Jie Wang; Bowei Yang; Aimin Wang; Zhigang Zhang

doi:10.1364/OE.426906

1. Introduction

Femtosecond fiber lasers are widely used in scientific research, medicine, and industries [1]. However, its pulse energy is still too low to apply in many fields such as laser accelerators [2], high harmonic generations [3], and large-scale material processing [4]. This is because the pulse energy is limited by the nonlinearity in fiber amplifiers.

Though tremendous efforts have been devoted to chirped pulse amplification to alleviate the nonlinearity of fiber amplifiers, the payoff of these pulse amplification methods began to level off [5] due to the pulse stretching ability. Then the beam or pulse combination in space domain and/or in time domain becomes an available choice [6]. In the space domain pulse [7,8], beams are combined in arithmetic progression, while in the time domain the pulses are stacked in geometrical progression, and therefore is more attractive [9–12].

Delay-line coherent pulse stacking (DL-CPS) [13] is one of the easiest ways to scale the pulse energy. The DL-CPS directly and symmetrically stack up the pulses from a chain of delay lines and phase pre-shaping. An N-stage DL-CPS system simply multiplies pulse energy by 2^N times by combining 2^N pulses, in which only N delay lines are needed to control and stabilize. On the other hand, it is not easy to find the matched delay lines and to stabilize them to ensure the pulses are coherently stacked.

The algorithm for controlling N delay lines pulse stacking can be considered a multi-parameter for a single target problem. In this case, the stacked pulse is frequency-doubled (SHG) and is detected by a photodiode. The control target is to maximize the SHG intensity by adjusting the delay line lengths since the SHG intensity is proportional to the squared intensity of pulses in fundamental wavelength.

The controlling algorithms of the delay lines can be the stochastic parallel gradient descent (SPGD) [14–19] or the single-detector electronic frequency tagging (LOCSET) [20]. However, the convergence speed of the SPGD algorithm is low and is inversely proportional to the number of delay lines. It also introduces noise by its dithering and searching process [21]. Because SPGD is typically a convex optimization solver, and coherent pulse stacking is a non-convex optimization problem, SPGD cannot guarantee a global optimal value of the system, unless the initial positions of the delay lines are preset manually.

In recent years, machine learning algorithms has been broadly used in many aspects in optics [22], including auto-tuning mode-locked laser [23–25], optical communications [26,27], optical microscopy [28], computational imaging [29,30], and phase detection (or estimation) [31–33].

Among the machine learning family, Deep Reinforcement Learning has the potential to strongly impact the control tasks in the optical society. Reinforcement learning (RL) tries to learn the optimal action from given observation signals under the interaction with the environment [34]. Deep RL incorporates deep learning [35] into the RL, allowing the controller to make decisions from unstructured input data without manual engineering. Deep RL-based algorithms are beginning to apply in traditional control problems in optics, which are conventionally controlled by PID or manually, such as white-light continuum generation [36], optical design [37], and pulse combination [38–40]. Compared with SPGD, Deep RL makes greater use of the environment information and learns from experience therefore could speed up the control process.

Deep RL was proposed to use in delay line beam combining in 2019 to obtain comparable performance to conventional SPGD with PID controllers [38]. Specifically, Deep-Q Network (DQN [41], one of the RL algorithms) was trained by watching the peak power of the combined pulse to control a simple 1-stage pulse combining system. Similarly, Deep Deterministic Policy Gradient (DDPG [42], another kind of RL strategy) tests the case of 4 channel combining environment, and which may take a long training time (typically several days).

In Ref. [38], the number of delay lines was only two. For a larger number of delay lines, more robust and faster algorithms are required. In our experiment, 7 or 8 delay lines are going to be controlled. For this purpose, we made the following improvements on the SPGD based and RL based controllers:

1. Introducing a “Momentum” [43] to SPGD to create the so-called SPGDM algorithm. The momentum helps to speed up the control process and to reduce the noise of dithering.
2. The pattern of the burst pulse train is introduced to and monitored with the Soft Actor-Critic (SAC [44], a stable RL algorithm). The burst pulse train pattern gives more information on the stacked pulses, such as the contrast or the satellite pulses than solely monitoring the peak power of the stacked pulses.
3. By combining the above two, we proposed a new algorithm—SAC-SPGDM. In the strategy of this algorithm, SAC dominates early the feedback control process, and SPGDM eventually joins and starts to become dominated by adding its weight to speed up the training and control convergence.

In this article, we will show how the SAC-SPGDM works and how it is applied to DL-CPS.

2. Robust algorithms for coherent pulse stacking

2.1 Delay line coherent pulse stacking

In this work, the SAC-SPGDM was designed for a 7-stage delay line pulse stacking system. The configuration of the experiment [45] is shown in Fig. 1. The seed pulses are from a femtosecond Yb:fiber laser [46], which delivers a train of near-Fourier-transform-limited pulses (∼70 fs) with 300 mW average power at 1 GHz repetition rate. The seed pulses are trunked into a burst containing 128 pulses at a 200 kHz burst rate. Then the pulses are split into two branches by a polarization beam splitter. Pulses in each branch undergo programmed phase preset to ensure that the polarization of consecutive combined pulses is perpendicular for the next stage stacking. The recombined pulse train is shifted in half period so that the pulse repetition rate and the number of pulses doubles. The burst pulse train was amplified with polarization maintaining fiber amplifiers. Then the pulses were sent to 7-stages of delay lines for stacking. Figure 2 shows the schematic of each delay line stacking model. The horizontally (H) polarized pulses are behind the vertical (V) polarized ones, therefore a delay line from a piezoelectric mirror (PZM) should be given to V pulses. The stacked pulse is frequency-doubled and detected by a photodetector. The intensity and the intra-burst pulse train will be A/D converted and sent to the controller, to feedback to the PZM. The controller determines the delay line length of PZMs to combine pulses. When 1^st, 2^nd,…, N^th delay line length matches 2°, 2¹,…, 2^N^-1 times of the pulse period, the 2^N pulses will be perfectly stacked, then the peak power of the output pulse reaches the global maximum. This state is named as the optimal matched state in the following text.

Fig. 1. Configuration for delay line coherent pulse stacking system. Only the 3-stage pulse combination was plotted for simplicity. The delay lines are not to scale. PBS: polarization beam splitter; PZM: piezoelectric transducer mounted mirror; AMP: fiber amplifier; WP: wave plate; PD: photo-detector; PC: computer and A/D converter.

Download Full Size | PDF

Fig. 2. Schematic of the delay line stacking model

Download Full Size | PDF

The parameter ${d_i}$ is defined as the delay line length of vertically polarized (V) pulse with respect to the adjacent pulse in horizontal polarization (H) in i-th stacking stage. Then ${\tau _i} = {d_i}/c$ ($c$ is the speed of light in vacuum) is the corresponding time-delay, by which the pulses are combined in the time domain. A piezoelectric mirror is used to stabilize the delay line, as shown in Fig. 2.

Figure 3 shows the SHG power of the combined pulse as a function of the time delay (${\tau _1}$, ${\tau _2}$) of a 2-stage pulse stacking system. They are very much like an autocorrelation trace as it is coherence. It is apparently a periodical non-convex function and is easy for the SPGD to be trapped in a saddle point (or local maximum point).

Fig. 3. (a): Heatmap of the SHG power of the combined pulse. Vertical and horizontal axes represent the time delay (fs) in the 1^st (${\tau _1}$) and 2^nd stages (${\tau _2}$). Larger power was represented by a lighter color. (b): Function surface of SHG power w.r.t 1^st stage time delay (${\tau _1}$) and 2^nd stage time delay (${\tau _2}$).

Download Full Size | PDF

In reality, the delay line length (displacement of PZM) fluctuates continuously due to the environment noise, which affects the peak power of the stacked pulse.

In the simulation, two kinds of noise were taken into consideration. The first is the fast noise which may be from the mechanical vibrations. This noise could be modeled as a zero-mean Gaussian random noise ${\rm N}(0,\sigma )$, where ${\rm N}$ represents the Gaussian distribution and $\sigma $ represents the variance of the distribution. The second is the slow noise, which corresponds to temperature drift. This noise could be characterized as a time-dependent bias $\mu$ of the system. Because temperature changes slowly with time. The drift introduced by the second kind of noise could be formulated as time-dependent system bias. By incorporating these two kinds of noise, overall noise e follows the distribution of:

(1)$$e \sim {\rm N}(\mu ,\sigma ).$$

Let $P$ be the SHG power (or peak power) of the final combined pulse, and ${\mathbf{d}} = [{d_1},{d_2}\textrm{,} \cdots \textrm{,}{d_N}]$ the delay lines for N-stage DL-CPS, then our objective is controlling delay line ${\mathbf d}$ to maximize SHG power $P$ under the noise $e$:

(2)$$\mathop {\max }\limits_{\mathbf d} P({\mathbf d};e) = \mathop {\max }\limits_{{d_1},{d_2}, \cdots ,{d_N}} P({d_1},{d_2}, \cdots ,{d_N};e).$$

In the following paragraphs, we use the noise e to the delay line ${\mathbf d}$ by default unless explicitly specified otherwise. But for the notation simplicity, we will not explicitly write the noise $e$ in the following paragraphs, i.e. we will use $P({\mathbf d})$ to express $P({\mathbf d};e)$.

There are two processes for the system to work: the first is the system initialization and the second the system stabilization. In the system initialization, the delay line lengths are aligned from the random to near the optimal matched state. Once the delay lines are set, they are subject to noise which makes the delay line length drift from the matched length labeled as the optimal matched state with maximum peak power.

We adapt the momentum into the SPGD (SPGDM) and combines SPGDM with SAC to form a new algorithm.

2.2 Stochastic parallel gradient descent with momentum

In the original SPGD method, the gradient estimation of the objective function is realized by applying a random perturbation to the delay lines’ length $\delta {\mathbf d} = [\delta {d_1},\delta {d_2}\textrm{,} \cdots \textrm{,}\delta {d_N}]$. Assume that at time step t, the current delay line ${{\mathbf d}^t}$ is near an optimal matched state (the maximum point at Fig. 3(b)), then the next (updated) step delay line ${{\mathbf d}^{t + 1}}$ that more close to the optimal matched state as:

(3)$${{\mathbf d}^{t + 1}} = {{\mathbf d}^t} + \eta \cdot [P ({{\mathbf d}^t} + \delta {{\mathbf d}^t}) - P ({{\mathbf d}^t})] \cdot \delta {{\mathbf d}^t},$$

where $\eta$ is a learning rate for the SPGD update process. $P ({{\mathbf d}^t})$ is the SHG power at time step t with the delay line ${{\mathbf d}^t}$ and $P ({{\mathbf d}^t} + \delta {{\mathbf d}^t})$ is the SHG power after perturbation with the delay line ${{\mathbf d}^t} + \delta {{\mathbf d}^t}$. Let $\varDelta$ denote as the perturbation strength $\varDelta = ||{\delta {\mathbf d}} ||$, then the effective gradient descent learning rate of the SPGD equals ${\eta _{\textrm{eff}}} = \eta \cdot {\varDelta ^2}$. If the effective learning rate is increased, the convergence speed of the SPGD algorithm becomes faster, on the cost of the error (or noise) introduced by the perturbation and searching process will also increase [14].

The “momentum” is equivalent to the exponential moving average (EMA) smoothing on the estimated gradient in the SPGD optimization. The behavior of the original SPGD near a maximum point is equivalent to a set of coupled and damped harmonic oscillators. The momentum term can improve the speed of convergence by bringing them closer to critical damping. This also makes SPGD more robust to the effective learning rate which could converge faster without increasing dithering error.

SPGD with momentum (SPGDM) algorithm is shown in Algorithm 1. The core of the algorithm is calculating momentum before update the delay line. Then update current delay lines using momentum value (line 5-7 in Algorithm 1) instead of using the gradient that SPGD did. Please note that when momentum factor $\beta = 0,$ the SPGDM of Algorithm 1 degenerates to the original SPGD. We will show the effectiveness of momentum term with simulation experiments in section 2.3 of Supplement 1.

oe-29-16-26068-i001

2.3 Soft actor-critic monitoring the pulse train

A basic RL agent interacts with its environment in discrete time steps. At each time step t, the agent receives the current state (observation) ${s^t}$ of the environment. Then the agent chooses an action ${a^t}$ and which affects the environment. The environment moves to the new state ${s^{t + 1}}$ and the reward ${r^t} = R ({s^t},{a^t},{s^{t + 1}})$ is feedback to the agent, where R is a reward function and ${r^t}$ is a reward value evaluated by current state-action. The agent trained with the experience $({s^t},{a^t},{s^{t + 1}},{r^t})$ to learn a policy $\pi$ that maximizes the expected cumulative reward. The policy $\pi$ means the probability of each action under the given state of the environment, $\pi (a,s) = \textrm{Pr}({a^t} = a|{s^t} = s)$. In our DL-CPS system, the current observation state ${s^t}$ means the input signal of the controller in Fig. 1. Action ${a^t}$ represents the feedback output (e.g. update value of the delay lines or driving voltage) of the controller that control the delay lines.

In this work, Soft Actor-Critic (SAC) algorithm [44], a robust variant of the Actor-Critic RL method [47], was used as a base RL agent. SAC is a maximum entropy deep reinforcement learning with a stochastic actor. In SAC, the maximum entropy formulation provides a substantial improvement in exploration and robustness: maximum entropy policies are robust in the face of model and estimation errors, they improve exploration by acquiring diverse behaviors. In addition, thanks to the Actor-Critic framework, SAC attains a substantial improvement in sample-efficiency and performance. Figure 4 shows the Actor-Critic framework: 1) The “Critic” estimates the state-action value ${Q _\mathrm{\pi }}$ to evaluate the policy $\pi$; 2) The “Actor” updates the policy distribution in the direction suggested by the Critic. (See detailed description of SAC in section 1 of Supplement 1.)

Fig. 4. Actor-Critic Framework.

Download Full Size | PDF

After training the SAC agent (see SAC training in Algorithm S1 of Supplement 1), we obtain the trained policy ${\hat{\pi }_\theta }$. Then the trained policy can be used to control the DL-CPS systems by selecting the optimal action:

(4)$${a^t} = \arg \mathop {\max }\limits_a {\hat{\pi }_\theta }(a,{s^t}).$$

The specification of the observation state $s$ directly affects the performance of RL algorithms. In the previous work [38], the state of the system was described by the last pulse power observation and the observations from the PZM monitor port. But the power of the final pulse was unable to describe the state of the DL-CPS system. Similar power may correspond to many different delay line settings, as shown in Fig. 3.

We describe the state $s$ as the combination of the observation from time-domain pulse trains ${I_{\textrm{pulse}}}(t)$ (e.g. sampled pulse trains by photodetector followed by data acquisition card) and observations from the PZM monitor port ${{\mathbf d}^t}$ (e.g. delay line length or driving voltages of PZM):

(5)$${s^t} = [{I_{\textrm{pulse}}}(t);{{\mathbf d}^t}].$$

We describe the action $a$ as the feedback output (e.g. update of the delay lines) of the delay line controller of the PZMs:

(6)$${a^t} = {{\mathbf d}^{t + 1}} - {{\mathbf d}^t}, {{\mathbf d}^{t + 1}} = {{\mathbf d}^t} + {a^t}.$$

We used the reward function as:

(7)$${r^t} = {\rm R} ({s^t}) ={-} {({\rm P}({s^t}) - {{\rm P}_{\max }})^2}/{{\rm P}_{\max }}^2,$$

where ${\rm P}({s^t})$ is a current peak power (or SHG power) of the combined pulse, and ${{\rm P}_{\max }}$ is the maximum peak power achieved at the optimal matched state. The maximum reward is 0 and achieved when ${\rm P}({s^t}) = {{\rm P}_{\max }}$. As the model moves better and better, the cumulative reward will be closer to zero instead of growing all the time, which helps to speed up the convergence of RL algorithms. We will show the effectiveness of using the observation from pulse trains than final power (as used in previous works) in section 2.4 of Supplement 1.

2.4 SAC-SPGDM

To fully utilize the power of SPGDM and SAC, we incorporate SPGDM with SAC into a robust and fast controlling algorithm called SAC-SPGDM. The strategy is that in the early steps, the SAC algorithm plays the major role to move all delay lines closer to the optimal matched state. As the number of iterations increases to a point, the SPGDM joined the process. The combination of the two algorithms is seen in the following two aspects:

1. Non-convex control and initial alignment by SAC. If the initial state of the system is far from the optimal matched state, SPGDM fails to converge. However, SAC makes use of his power of exploration to align the delay lines to move to a better state of near the optimal matched.
2. Speed up by SPGDM. When the environment is very complicated, the SAC (or any RL method) must take a very long time to train. However, the SPGDM could provide additional information (approximate gradient) about the environment (at least for the state of near matched). Therefore, incorporating SPGDM during SAC training could speed up the training process.

Algorithm 2 shows the action selection for SAC-SPGDM. Given the current state ${s^t}$ (combination of current pulse trains ${I_{\textrm{pulse}}}(t)$ and delay lines ${{\mathbf d}^t}$), two different current actions ${a_{\textrm{SAC}}}^t$, ${a_{\textrm{SPGDM}}}^t$ was calculated by SAC and SPGDM respectively. Then the final combined action ${a^t}$ is given by the weighted average of ${a_{\textrm{SAC}}}^t$ and ${a_{\textrm{SPGDM}}}^t$. The current balancing weighted factor ${\mu ^t}$ was calculated by piecewise linear function served as a warm-up schedule. In the beginning $t = 1$, the initial state of the system is not good enough for SPGDM, so we only use the action from SAC, ${a^t} = {a_{\textrm{SAC}}}^t$. Then in the following warmup steps ${T_{\textrm{warm}}}$, the weight of the SPGDM increases. After $t > {T_{\textrm{warm}}}$, the weight of the SPGDM stays constant of $\mu$, and ${a^t} = (1 - \mu ) \cdot {a_{\textrm{SAC}}}^t + \mu \cdot {a_{\textrm{SPGDM}}}^t$.

oe-29-16-26068-i002

3. Numerical experiments

3.1 Experiments on combining 128 pulses

The evaluation of the proposed algorithms was performed for a 7-stage DL-CPS as shown in Fig. 1. The task of the algorithm is to match each delay line to the optimal matched state as fast as possible. For this purpose, the RL agent was trained in the designed two sets of situations. 1) starting from the “random” initial state which is far from the optimal matched. This condition is with respect to the stacking system that is about to do initial alignment, which is normally set manually; 2) starting from the fairly “good” initial condition which is near the optimal matched, which corresponds to the situation where the stacking system drifts off the optimized conditions due to noise. In our experiments, control convergence means the combined peak power achieves 90% of the maximum power of the optimal matched state. And the convergence step means the number of steps to achieve 90% of the maximum power under control.

3.1.1 Controlling DL-CPS from the random initial state

For the training procedure of SAC-SPGDM and SAC, cumulative reward on each training episode was shown in Fig. 5(a). The training procedure for an RL agent is divided into a few episodes, each episode lasts for $T = 200$ steps. As can be seen, the training convergence speed of SAC-SPGDM (converged ∼50 episodes) is much faster than the speed of SAC (converged ∼110 episodes) by ∼50%. Since the incorporated SPGDM could provide additional guidance and information for the exploration of SAC.

Fig. 5. (a) Training of the SAC-SPGDM and SAC: cumulative reward w.r.t training episodes. (b) Evaluation step of the SAC-SPGDM and SAC: Output power (a.u.) w.r.t controlling steps. In this scenario SPGDM (and SPGD) could not find the maximum peak power. Red curve: SAC-SPGDM; purple curve: SAC; blue: SPGDM.

Download Full Size | PDF

After training, we evaluate the SAC-SPGDM and SAC on controlling the DL-CPS from a random initial state. As shown in Fig. 5(b), the final power after 30 steps control for these two algorithms is similar, but the control convergence speed of SAC-SPGDM is faster than SAC.

Figure 6 shows the combined pulse train on different steps controlled by SAC-SPGDM. As can be seen from Fig. 6(a), because the delay lines are not matched initially, the intensity of individual pulses is also random. However, SAC-SPGDM could find a way that takes 10 steps to move the delay lines to nearly matched so that the combined pulse appears (Fig. 6(b)). After 20 steps, the SAC-SPGDM pushes the delay line lengths matched so that all the pulses are stacked into one (Fig. 6(c)). The animation of this process and comparison with other controllers can be seen in Visualization 1.

Fig. 6. Combined output burst pulse trains under the control of SAC-SPGDM from the random initial state. (a): the initial state of pulses; (b) pulse state after 10-step control; (c) pulse state after 20-step control; See detailed figures of the combined pulses in section 2.1 of Supplement 1 and animation video in Visualization 1.

Download Full Size | PDF

3.1.2 Controlling DL-CPS from the nearly matched initial state

Then we conducted the experiment from a nearly matched or slightly mismatched “good” initial point. This can correspond to the situation after running SAC-SPGDM for the first few steps. Four algorithms: SAC-SPGDM, SAC, SPGDM, and SPGD were run respectively and their convergence speed of evaluation control is shown in Fig. 7. As a training-free method, SPGDM achieved 90% maximum power within 17 steps, which is faster than SPGD (∼28 steps) by 40%. This is because the momentum can improve the speed of convergence by bringing the optimization closer to the fast convergence direction. Furthermore, SAC-SPGDM still converged fastest and keep stable after reach the optimal matched state, despite the impact of noise. SAC-SPGDM achieved the 90% maximum power within 10 steps, which is faster than the SAC (∼13 steps) by 23% and faster than SPGDM by 41%. One can refer to the figures of the combined pulses in section 2.2 of Supplement 1 and the animation of the combining procedure in Visualization 2. The comparison of SAC-SPGDM controller and free running on 7-stage coherent pulse stacking under noise is shown in Visualization 3.

Fig. 7. Evaluation step of the SAC-SPGDM, SAC, SPGDM, and SPGD. Red curve: SAC-SPGDM; purple curve: SAC, blue curve: SPGDM; green curve: SPGD

Download Full Size | PDF

3.2 Experiments on different stage DL-CPS

The number of training episodes for SAC and SAC-SPGDM vs different stage DL-CPS (different number of delay lines) from 3 to 8 is shown in Fig. 8. The number of episodes (training iterations) needed to train increases significantly with the number of stages. Then we conclude that the training convergence speed of SAC-SPGDM is much faster than the convergence speed of SAC. Thus, for a more complex system, SAC-SPGDM should be used.

Fig. 8. Training convergence episode for SAC and SAC-SPGDM on different stage systems.

Download Full Size | PDF

After training, the control convergence of SAC-SPGDM and SAC on different stage systems was also investigated. Starting from the good initial alignment (near the matched state), the control-convergence speed after training vs the number of stages in DL-CPS was shown in Fig. 9(a). For SAC and SAC-SPGDM, the steps need to control almost unchanged as the increase in number of stages. While the number of control steps increases significantly with the number of stages for SPGD and SPGDM. Figure 9(b) shows the control-convergence speed vs the number of stages in DL-CPS starting from the random initial state. As can be seen from Fig. 9(b), SAC-SPGDM has a faster convergence speed (fewer steps to converge). For combining 256 pulses in 8-stage DL-CPS, SAC-SPGDM controls the delay lines to achieve the maximum peak power in 14 steps when the initial state is nearly matched (Fig. 9(a)), and to achieve the maximum peak power in 24 steps when the initial state is random (Fig. 9(b)).

Fig. 9. (a) Control convergence step of different algorithms starting from good initial alignment. (b) Control convergence step of different algorithms starting from the random initial state.

Download Full Size | PDF

3.3 Experiments of transfer trained policy to different power levels

After training SAC-SPGDM algorithms, the trained (or learned) policy could explore the maximum combined power it could achieve. The trained policy does not rely on the total output power to ‘know’ it’s in the global maximum. The more important factor is the input pulse observation (oscilloscope) signal. If each pulse is stacked into one, the trained policy ‘know’ it’s in the global maximum.

Now, we investigate a more complex situation: Training SAC-SPGDM algorithms in a DL-CPS system with a given power level then transferring the trained policy to another system with different source laser’s power levels.

As an example, we trained SAC-SPGDM algorithm on the DL-CPS system ${\Sigma _{p1}}$ with the source laser of ${p_1} = 300\textrm{ mW}$ average power. After training, we transfer the trained policy to another system ${\Sigma _{p2}}$ with the source laser of ${p_2} = 500\textrm{ mW}$ average power. We evaluate the transferring performance in the following two transferring strategies:

(1) Transfer with direct observation (direct transfer). In this case, we directly apply the trained policy to the new system ${\Sigma _{p2}}$. After receiving pulse observation ${I_{\textrm{pulse,}{\Sigma _{p2}}}}(t)$ on the new system ${\Sigma _{p2}}$, we directly feed the pulse observation ${I_{\textrm{pulse,}{\Sigma _{p2}}}}(t)$ to the trained policy to control the new system ${\Sigma _{p2}}$.
(2) Transfer with scaled observation. In this case, we scale the pulse observation signal on the new system then feed the scaled observation to the trained policy to get action. After receiving pulse observation ${I_{\textrm{pulse,}{\Sigma _{p2}}}}(t)$ on the new system ${\Sigma _{p2}}$, we scale the pulse observations using the ratio of the old laser’s power ${p_1}$ to the new laser’s power ${p_2}$: $(8)$${I_{\textrm{pulse,scaled}}} = \frac{{{p_1}}}{{{p_2}}}{I_{\textrm{pulse,}{\Sigma _{p2}}}}(t) = \frac{{300\textrm{ mW}}}{{500\textrm{ mW}}}{I_{\textrm{pulse,}{\Sigma _{p2}}}}(t).$$$

The scaled pulse observation ${I_{\textrm{pulse,scaled}}}$ is consistent with the power level of the old training system ${\Sigma _{p1}}$.

As shown in Fig. 10, with the scaled observation strategy (orange curve), the trained policy could successfully achieve good performance on the different power level system ${\Sigma _{p2}}$ as good as the evaluation on the original training system ${\Sigma _{p1}}$ (blue curve). The performance of the direct transfer (green curve) is much lower than the transfer with scaled observation. Thus, it is possible to transfer trained policy to different power level systems, but we need to scale the pulse observation from the new system ${\Sigma _{p2}}$ with the ratio of ${p_1}\textrm{/}{p_2}$. Where ${p_1}$ is average power of the source laser on the original training system ${\Sigma _{p1}}$, and ${p_2}$ is average power of the source laser on the new transferred system ${\Sigma _{p2}}$.

Fig. 10. Train SAC-SPGDM on the system ${\Sigma _{p1}}$ with laser power ${p_1} = 300\textrm{ mW}$, then transfer the trained policy to the new system ${\Sigma _{p2}}$ with laser power ${p_2} = 500\textrm{ mW}$. Blue curve: performance of the trained policy on the original system ${\Sigma _{p1}}$. Orange curve: transfer performance of the trained policy to the new system ${\Sigma _{p2}}$ with known scaled observation. Green curve: transfer performance of the trained policy to the new system ${\Sigma _{p2}}$ directly.

Download Full Size | PDF

3.4 Experiments of the impact of different feedback delay

Feedback loops in control systems are always associated with time delays due to the finite speed of sensing, signal processing, computation of the control input, and actuation. Let the state of a DL-CPS system at time step t is ${s^t}$ (combination of the pulse trains ${I_{\textrm{pulse}}}(t)$ and the delay lines ${{\mathbf d}^t}$). Suppose that, photo-detection, data acquisition, analog-to-digital conversion, action computation by SAC-SPGDM algorithm will cost time $\Gamma $. Thus, there is a feedback delay $\Gamma $ in our DL-CPS controlling system. Then the system state shifted to ${s^{t + \Gamma }}$ (combination of the pulse trains ${I_{\textrm{pulse}}}(t + \Gamma )$ and the delay lines ${{\mathbf d}^{t + \Gamma }}$) when conducting action by SAC-SPGDM controller.

In our simulation, we implemented the feedback delay by introducing free-running noise (e.g. environment vibration) on the delay lines during the process of photo-detection, data preprocessing, and action (update value) computation.

Our algorithms could be trained on the real laser system with real feedback delay. Now, we investigate a more complex situation: Training SAC-SPGDM algorithms in a low feedback delay system then transferring the trained policy to a high feedback delay system.

We trained the SAC-SPGDM algorithms on a low delay system ($\Gamma = 1\textrm{ ms}$) then test the trained policy on: (1) same low delay ($\Gamma = 1\textrm{ ms}$), (2) medium delay ($\Gamma = 1\textrm{0 ms}$), (3) high delay ($\Gamma = 10\textrm{0 ms}$) systems. Figure 11 shows the evaluation results of transferring the trained policy to low feedback delay system (same as training system), medium feedback delay system (10 times higher delay than original training system), and high feedback delay system (100 times higher delay than original training system). As can be seen, for any of these three feedback delays, the trained SAC-SPGDM algorithm could converge within 25 steps, then ensure the combined pulse power larger than 0.9 (a.u.). The higher feedback delay causes the larger jitter of the combined pulse power because higher delay means the system easier to be affected by random noise and temperature drifts.

Fig. 11. After trained on a low feedback delay system ($\Gamma = 1\textrm{ ms}$), evaluation step of the trained model on the low feedback delay ($\Gamma = 1\textrm{ ms}$), medium feedback delay ($\Gamma = 1\textrm{0 ms}$), and high feedback delay ($\Gamma = 1\textrm{0}0\textrm{ ms}$) systems: output power (a.u.) w.r.t controlling steps.

Download Full Size | PDF

One of the important conclusions of Fig. 11 is that it is applicable to transferring the trained policy from low feedback delay systems to higher feedback delay systems. Thus, we could explore fast and robust controlling algorithms in the simulation environment (lower feedback delay) then deploy the trained policy to real-world physical systems (higher feedback delay).

3.5 Connections with the real-world experiment

In the previous works, RL algorithms (DQN and DDPG) take 4-hours to train a simple 1-stage CPS system, and which takes 1-2 days to train a 4-stage CPS system [38]. One of the reasons for the slow training is that deploying the RL algorithm in an optics system needs to convert optical signal to analog electrical signal using photo-detector, then convert the analog signal to digital signal using an analog-to-digital converter. These two conversions cost some additional time to process the signal.

Our proposed algorithms could be trained on real laser systems. Training on real laser systems may have more time than in the simulation environment. For example, as shown in Fig. 5(a), SAC-SPGDM algorithms almost take 50 epoch × 200 steps/epoch = 10000 steps to converge on a 7-stage DL-CPS environment. If the real-world process time of the CPS system is 0.1s per interaction step, the SAC-SPGDM algorithm would take about 3 hours to train. The longer training time made the temperature drift and noise more severe. It also causes some additional energy costs in the real-world optical system. Thus the fast training and noise-robust controlling algorithms are critical to DL-CPS systems, which is our main concern about studying effective learning-based algorithms in simulation environments. Our work is to explore fast and robust controlling algorithms in the simulation environment then deploy them to the real-world control system.

The source code implementation of the DL-CPS simulation environment and the proposed controlling algorithms are shown in Code 1 [Ref. [48]].

4. Conclusions

We proposed a fast and robust controlling algorithm for a coherent pulse tacking system. First, we modified the SPGD algorithm by incorporating momentum to create SPGDM. SPGDM converges faster than the original SPGD when the initial state of the system is near to the optimal matched state after proper initial alignment. Then we combined SAC with SPGDM. The simulated experiment demonstrated that the SAC-SPGDM algorithm could bring the system from an initially random state to the optimal matched state to achieve the maximum peak power of the combined pulses. Furthermore, the SAC-SPGDM algorithm has the potential to extend more stages or the beam combinations.

Funding

National Natural Science Foundation of China (61735001, 61761136002).

Disclosures

The authors declare no conflicts of interest.

Data Availability

Data underlying the results presented in this paper are available in Ref. [48].

Supplemental document

See Supplement 1 for supporting content.

References

1. M. E. Fermann and I. Hartl, “Ultrafast fibre lasers,” Nat. Photonics 7(11), 868–874 (2013). [CrossRef]

2. G. Mourou, B. Brocklesby, T. Tajima, and J. Limpert, “The future is fibre accelerators,” Nat. Photonics 7(4), 258–261 (2013). [CrossRef]

3. S. Hädrich, J. Rothhardt, M. Krebs, F. Tavella, A. Willner, J. Limpert, and A. Tünnermann, “High harmonic generation by novel fiber amplifier based sources,” Opt. Express 18(19), 20242–20250 (2010). [CrossRef]

4. H. Kalaycıoğlu, P. Elahi, Ö. Akçaalan, and F. Ö. Ilday, “High-Repetition-Rate Ultrafast Fiber Lasers for Material Processing,” IEEE J. Sel. Top. Quantum Electronics 24(3), 1–12 (2018). [CrossRef]

5. T. Eidam, J. Rothhardt, F. Stutzki, F. Jansen, S. Hädrich, H. Carstens, C. Jauregui, J. Limpert, and A. Tünnermann, “Fiber chirped-pulse amplification system emitting 3.8 GW peak power,” Opt. Express 19(1), 255–260 (2011). [CrossRef]

6. D. J. Richardson, J. Nilsson, and W. A. Clarkson, “High power fiber lasers: current status and future perspectives,” J. Opt. Soc. Am. B 27(11), B63–B92 (2010). [CrossRef]

7. A. Klenke, S. Breitkopf, M. Kienel, T. Gottschall, T. Eidam, S. Hädrich, J. Rothhardt, J. Limpert, and A. Tünnermann, “530 W, 1.3 mJ, four-channel coherently combined femtosecond fiber chirped-pulse amplification system,” Opt. Lett. 38(13), 2283–2285 (2013). [CrossRef]

8. M. Kienel, M. Müller, A. Klenke, J. Limpert, and A. Tünnermann, “12 mJ kW-class ultrafast fiber laser system using multidimensional coherent pulse addition,” Opt. Lett. 41(14), 3343–3346 (2016). [CrossRef]

9. M. Kienel, M. Müller, A. Klenke, T. Eidam, J. Limpert, and A. Tünnermann, “Multidimensional coherent pulse addition of ultrashort laser pulses,” Opt. Lett. 40(4), 522–525 (2015). [CrossRef]

10. H. Stark, M. Müller, M. Kienel, A. Klenke, J. Limpert, and A. Tünnermann, “Electro-optically controlled divided-pulse amplification,” Opt. Express 25(12), 13494–13503 (2017). [CrossRef]

11. T. Zhou, J. Ruppe, C. Zhu, I. Hu, J. Nees, and A. Galvanauskas, “Coherent pulse stacking amplification using low-finesse Gires-Tournois interferometers,” Opt. Express 23(6), 7442–7462 (2015). [CrossRef]

12. I. Astrauskas, E. Kaksis, T. Flöry, G. Andriukaitis, A. Pugžlys, A. Baltuška, J. Ruppe, S. Chen, A. Galvanauskas, and T. Balčiūnas, “High-energy pulse stacking via regenerative pulse-burst amplification,” Opt. Lett. 42(11), 2201–2204 (2017). [CrossRef]

13. H. Tünnermann and A. Shirakawa, “Delay line coherent pulse stacking,” Opt. Lett. 42(23), 4829–4832 (2017). [CrossRef]

14. G. Cauwenberhs, “A fast stochastic error-descent algorithm for supervised learning and optimization,” Advances in neural information processing systems, 5, 244–251 (1993).

15. P. Zhou, Z. Liu, X. Wang, Y. Ma, H. Ma, X. Xu, and S. Guo, “Coherent beam combining of fiber amplifiers using stochastic parallel gradient descent algorithm and its application,” IEEE J. Sel. Top. Quantum Electronics 15(2), 248–256 (2009). [CrossRef]

16. H. Chang, J. Xi, R. Su, P. Ma, Y. Ma, and P. Zhou, “Efficient phase-locking of 60 fiber lasers by stochastic parallel gradient descent algorithm,” Chin. Opt. Lett. 18(10), 101403 (2020). [CrossRef]

17. H. Chang, Q. Chang, J. Xi, T. Hou, R. Su, P. Ma, J. Wu, C. Li, M. Jiang, Y. Ma, and P. Zhou, “First experimental demonstration of coherent beam combining of more than 100 beams,” Photon. Res. 8(12), 1943–1948 (2020). [CrossRef]

18. S. B. Weiss, M. E. Weber, and G. D. Goodno, “Group delay locking of coherently combined broadband lasers,” Opt. Lett. 37(4), 455–457 (2012). [CrossRef]

19. A. Abuduweili, B. Yang, and Z. Zhang, “Modified stochastic gradient algorithms for controlling coherent pulse stacking,” in Conference on Lasers and Electro-Optics (CLEO), STh4P.1 (2020)

20. T. M. Shay, V. Benham, J. T. Baker, A. D. Sanchez, D. Pilkington, and C. A. Lu, “Self-synchronous and self-referenced coherent beam combination for large optical arrays,” IEEE J. Sel. Top. Quantum Electron. 13(3), 480–486 (2007). [CrossRef]

21. Q. Du, T. Zhou, L. R. Doolittle, G. Huang, D. Li, and R. Wilcox, “Deterministic stabilization of eight-way 2d diffractive beam combining using pattern recognition,” Opt. Lett. 44(18), 4554–4557 (2019). [CrossRef]

22. G. Genty and L. Salmela , J. M. Dudley , D. Brunner , A. Kokhanovskiy, S. Kobtsev, and S. K. Turitsyn, “Machine learning and applications in ultrafast photonics,” Nat. Photonics 15(2), 91–101 (2021). [CrossRef]

23. T. Baumeister, S. L. Brunton, and J. N. Kutz, “Deep learning and model predictive control for self-tuning mode-locked lasers,” J. Opt. Soc. Am. B 35(3), 617–626 (2018). [CrossRef]

24. G. Pu, L. Yi, L. Zhang, and W. Hu, “Intelligent programmable mode-locked fiber laser with a human-like algorithm,” Optica 6(3), 362–369 (2019). [CrossRef]

25. G. Pu, L. Yi, L. Zhang, and W. Hu, “Genetic algorithm-based fast real-time automatic mode-locked fiber laser,” IEEE Photonics Technol. Lett. 32(1), 7–10 (2020). [CrossRef]

26. Z. Yang, J. Ke, W. Hu, and L. Yi, “Effect of ADC parameters on neural network based chaotic optical communication,” Opt. Lett. 46(1), 90–93 (2021). [CrossRef]

27. Q. Zhuge, X. Zeng, H. Lun, M. Cai, X. Liu, L. Yi, and W. Hu, “Application of Machine Learning in Fiber Nonlinearity Modeling and Monitoring for Elastic Optical Networks,” J. Lightwave Technol. 37(13), 3055–3063 (2019). [CrossRef]

28. Y. Rivenson, Z. Göröcs, H. Günaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica 4(11), 1437–1443 (2017). [CrossRef]

29. G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning for computational imaging,” Optica 6(8), 921–943 (2019). [CrossRef]

30. Y. Wu, Y. Rivenson, Y. Zhang, Z. Wei, H. Günaydin, X. Lin, and A. Ozcan, “Extended depth-of-field in holographic imaging using deep-learning-based autofocusing and phase recovery,” Optica 5(6), 704–710 (2018). [CrossRef]

31. J. White and Z. Chang, “Attosecond streaking phase retrieval with neural network,” Opt. Express 27(4), 4799–4807 (2019). [CrossRef]

32. T. Hou, Y. An, Q. Chang, P. Ma, J. Li, D. Zhi, L. Huang, R. Su, J. Wu, Y. Ma, and P. Zhou, “Deep-learning-based phase control method for tiled aperture coherent beam combining systems,” High Power Laser Sci. Eng. 7, e59 (2019). [CrossRef]

33. D. Wang, Q. Du, T. Zhou, D. Li, and R. Wilcox, “Stabilization of the 81-channel coherent beam combination using machine learning,” Opt. Express 29(4), 5694–5709 (2021). [CrossRef]

34. R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” Massachusetts Institute of Technology, (2018).

35. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

36. C. M. Valensise, A. Giuseppi, G. Cerullo, and D. Polli, “Deep reinforcement learning control of white-light continuum generation,” Optica 8(2), 239–242 (2021). [CrossRef]

37. T. Yang, D. Cheng, and Y. Wang, “Designing freeform imaging systems based on reinforcement learning,” Opt. Express 28(20), 30309–30323 (2020). [CrossRef]

38. H. Tünnermann and A. Shirakawa, “Deep reinforcement learning for coherent beam combining applications,” Opt. Express 27(17), 24223–24230 (2019). [CrossRef]

39. H. Tünnermann and A. Shirakawa, “Deep reinforcement learning for tiled aperture beam combining in a simulated environment,” J. Phys. Photonics 3(1), 015004 (2021). [CrossRef]

40. A. Abuduweili, B. Yang, and Z. Zhang, “Control of delay lines with reinforcement learning for coherent pulse stacking,” in Conference on Lasers and Electro-Optics (CLEO), JW2F.33 (2020)

41. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602 (2013)

42. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” In International Conference on Learning Representation (ICLR), (2016).

43. N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural Networks 12(1), 145–151 (1999). [CrossRef]

44. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” In International Conference on Machine Learning (PMLR), 1861–1870 (2018).

45. B. Yang, G. Liu, A. Abulikemu, Y. Wang, A. Wang, and Z. Zhang, “Coherent stacking of 128 pulses from a GHz repetition rate femtosecond Yb:fiber laser,” In Conference on Lasers and Electro-Optics (CLEO), JW2F.28 (2020)

46. C. Li, Y. Ma, X. Gao, F. Niu, T. Jiang, A. Wang, and Z. Zhang, “1 GHz repetition rate femtosecond Yb:fiber laser for direct generation of carrier-envelope offset frequency,” Appl. Opt. 54(28), 8350 (2015). [CrossRef]

47. R. S. Sutton, D. McAllester, S Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in neural information processing systems, 99, 1057–1063 (1999).

48. A. Abuduweili, “Learning-based robust control algorithms for coherent pulse stacking,” GitHub Repository (2021), https://github.com/Walleclipse/Reinforcement-Learning-Pulse-Stacking.

Name	Description
Code 1	Source code of the coherent pulse stacking simulation environment and the SAC-SPGDM controlling algorithm.
Supplement 1	A detailed description of the SAC training and additional experiment.
Visualization 1	Comparison of SAC-SPGDM controller and free running on 7-stage coherent pulse stacking system from the nearly matched initial state
Visualization 2	Controlling coherent pulse stacking from the random initial state
Visualization 3	Controlling coherent pulse stacking from the nearly matched initial state

Reinforcement learning based robust control algorithms for coherent pulse stacking

Abstract

1. Introduction

2. Robust algorithms for coherent pulse stacking

2.1 Delay line coherent pulse stacking

2.2 Stochastic parallel gradient descent with momentum

2.3 Soft actor-critic monitoring the pulse train

2.4 SAC-SPGDM

3. Numerical experiments

3.1 Experiments on combining 128 pulses

3.1.1 Controlling DL-CPS from the random initial state

3.1.2 Controlling DL-CPS from the nearly matched initial state

3.2 Experiments on different stage DL-CPS

3.3 Experiments of transfer trained policy to different power levels

3.4 Experiments of the impact of different feedback delay

3.5 Connections with the real-world experiment

4. Conclusions

Funding

Disclosures

Data Availability

Supplemental document

References

Supplementary Material (5)

Data Availability

Cited By

Figures (11)

Equations (8)

Optics Express