StarLight: a photonic neural network accelerator featuring a hybrid mode-wavelength division multiplexing and photonic nonvolatile memory

Pengxing Guo; Pengxing Guo; Niujie Zhou; Niujie Zhou; Weigang Hou; Weigang Hou; Lei Guo; Lei Guo

doi:10.1364/OE.468456

1. Introduction

Thanks to the high accuracy, artificial neural networks (ANNs) [1–3], have become the de facto solution to solve various artificial intelligence (AI) problems such as object recognition and speech processing. The ANN networks use layers of interconnected artificial neurons to perform complex mathematical operations. There are two phases in ANN deployments: training and inference. Specifically, the training phase learns the weight values from the labelled inputs, while the inference phase uses learned weights to classify or predict output values for unknown inputs [4]. In ANN, the most computation-intensive fundamental operation is called matrix-vector multiplication. Once an ANN model is trained, ANN can infer the unseen relationships for unlabeled inputs. That is billions of inferences could be performed. For instance, an inference of AlexNet [2] requires 724M floating-point multiply-accumulate (MAC) operations. To reduce the overhead caused by transferring data and conducting computation for lots of MAC operations, researchers propose excellent designs ranging from Von Neumann architectures [5–8], to in-memory computation [9–11]. However, under the constraints of electronic properties and Moore’s Law’s slowdown, electrical-based accelerators cannot improve energy efficiency and computation frequency further.

To break through the electronic bottleneck, the photonic ANN (PANN) accelerator emerges as the outstanding candidate solution for providing high-performance and low-energy AI services. In 2017, Shen et al., used 56 programmable mach-zehnder interferometers (MZIs) in a triangular cascaded array structure to achieve a 4$\times$4 photonic weight matrix. The experiments demonstrate that the proposed architecture could offer an enhancement in computational speed and power efficiency over state-of-the-art electronics for conventional inference tasks [12]. Several other MZIs-based PANN accelerators also been proposed [13–15]. Most of them use the singular value decomposition (SVD) principle to decompose any matrix into two unitary matrices and a diagonal matrix and use the MZI array to realize the optical unitary and diagonal matrix, then cascade the unitary-diagonal-unitary matrix to realize MAC operations. However, the area of the MZI structure is large. In addition, the number of MZI units required to realize an $N\times N$ scale matrix is $O(N^2)$, which seriously affects its scalability.

Microring resonators (MRs) are another type of photonic device that can be used to perform MAC operations in PANNs [16–23]. They can use their transmittance as weight values and combine wavelength division multiplexing (WDM) technology to achieve large-scale matrix calculations. Various MR-based PANN accelerators have been proposed [16–21]. In [16], Tait et al., proposed a MR-based ‘broadcast-and-weight’ architecture, and realized the demonstration of 4$\times$ 4 weighted devices based on 16 MRs, and get an acceleration factor of 294$\times$ than conventional CPU in [17]. Similar with MR-based ‘broadcast-and-weight’ architecture, combining MR and WDM technologies, researchers can easily use the speed of light to conduct massively parallel MAC operations in ANN, such as ConvLight [18], DEAP-CNN [20], and HolyLight [21]. They demonstrate that PANNs can outperform electronic products by promising energy efficiency and speed improvement.

So far, prior PANN accelerators need to read the weights from the external memory when performing the multiplication operation and mapping each value to the bias voltage of the MR or MZI units to achieve weight configuration. But in ANN, once the weights are trained, they do not need to be updated frequently during the inference stage. Therefore, storing weight values in non-volatile phase-change materials (PCM) and conducting computation directly can decrease the data transfer overhead dramatically. In [24], the authors use Ge$_2$Sb$_2$Te$_5$ (GSTs) (a kind of PCM) and directional coupler to form an on-chip photonic passive MAC unit, which is capable of operating at speeds of trillions of MAC per second. In this structure, the author distributes the input signal power equally by constructing directional couplers with different split ratios and then multiplies with PCMs with different weights to achieve parallel computing. The scale of parallel computing is related to the number of directional couplers that can be integrated. However, it is difficult to implement multiple directional couplers with precise and different splitting ratios in the same network, resulting in limited scalability. In [25], the authors use PCM and MRs to implement an emerging photonic in-memory computing neural network accelerator. Unlike the electro-optic (EO)-MR based PANN, the PCMs are integrated into the MRs. The PCM in different states affects the effective refractive index of the MRs, enabling control over the transmittance of the MR. However, due to the narrow channel spacing and the limitations of the MR itself, it has limited scalability and inference accuracy.

Moreover, massive parallelism is crucial for enabling high-performance ANN accelerators. However, current photonic ANN accelerators only apply multi-wavelength channels to realize parallel computing. Multiplexing in other dimensions, especially multi-dimensional hybrid multiplexing, is overlooked. Multimode communication in optical waveguide offers an additional degree of freedom to scale communication bandwidth [26]. In addition, the optical signals with different modes and wavelengths can be transmitted simultaneously in one waveguide by hybrid mode-wavelength division multiplexing (MDM-WDM) technology [26–30]. However, existing photonic ANN accelerators do not integrate the hybrid MDM-WDM technology, thus missing opportunities to further increase inference throughput.

Therefore, nanophotonic accelerators are deemed to pave the way for performing ANNs with low latency and power consumption. However, existing designs face the following challenges: (1) Although PCM is used as the photonic non-volatile analog memory to perform in-memory computation, the scalability and accuracy of existing PCM-based designs are limited. (2) The parallelism of the PANN is not been fully exploited since on-chip multidimensional multiplexing, such as hybrid MDM-WDM technology, has not been integrated into ANN computation. In order to solve the above problems, in this paper, we propose a novel ‘in-memory’ computing and hybrid MDM-WDM PANN accelerator for completing the inference process and name it StarLight. In summary, the main contributions of this paper are as follows:

(1) We for the first time propose a novel computation scheme that combines WDM and MDM in PANN, realizing a high degree of computing parallelism in not only spatial but also wavelength domains. In addition, compared with the WDM-based PANN, the WDM-MDM-based method can effectively reduce the number of wavelengths used and the requirements for the free spectral range (FSR) and Q-factor of the microring.
(2) We further integrate PCM GSTs (Ge$_2$Sb$_2$Te$_5$) and passive MRs into StarLight to achieve photonic in-memory processing, which helps reduce the energy consumption of the whole StarLight. Moreover, by decoupling the PCM with MR, the crosstalk between adjacent MRs is decreased, resulting in high scalability and accuracy.
(3) We conduct the simulation on the Iris dataset classification using a 4$\times$4$\times$4 PVMM and achieve an inference accuracy of 96$\%$. Simulation results show that the proposed structure provides a computing density of several TMAC/s/mm$^2$.

The rest of the paper is organized as follows. Section 2. discusses the proposed StarLight architecture, the photonic dot-product engine and the photonic adder based on hybrid MDM-WDM are reflected. In Section 3., we give the simulation results, including transmission spectra and the power loss of MRs, GST, mode converters, the computing speed and accuracy of 4$\times$4$\times$4 PVMM, and the Iris dataset classification task. Finally, we conclude this work in Section 4.

2. StarLight design

As mentioned before, the most computation-intensive operation of ANN is the MAC operation which consists of multiplication and addition operations. Thus, increasing the processing efficiency of MAC will contribute greatly to ANN performance improvement. In this section, in order to implement in-memory MAC functions using photonic devices, we first propose a photonic engine based on GST and passive MRs (PMRs) to perform the multiplication operation. Then, we design a photonic adder using the multiplexing characteristics of wavelength and mode. Finally, we provide details of our novel PANN accelerator architecture, StarLight.

2.1 Photonic dot-product engine based on GST and MRs

Although the previous MRs based PANN accelerators, such as EO-MR based HolyLight-M [21] and PCM-MR based accelerator [25], have shown their efficiency, these designs also have presented certain drawbacks. The shortcomings of EO-MR-based designs are: (1) The weight of each MR is changed by adjusting the bias voltage without having any storage capability. (2) EO-MR requires a steady bias voltage to keep its state, which results in energy consumption. (3) Each active MR needs two electrical pads with the footprint of each pad being at least 150 $\times$ 150 $\mu m^2$ [31], which is much larger than the area of the MR itself, thus bringing a huge area cost. In addition, both EO-MR and PCM-MR-based designs suffer from limited scalability and inference accuracy. To adjust the weight of the MR, the resonant frequency of the MR must be shifted by changing the state of the PCM or applying a different bias voltage, which increases the crosstalk between adjacent MRs and thus affects the output result. The way to eliminate such influence is to increase the wavelength separation of adjacent MRs, which reduces the number of multiplexable wavelengths that decreases scalability.

Different from the existing designs, we use two passive add-drop MRs and a GST to conduct the photonic dot-product, as shown in Fig. 1(a). The transmission spectrum of the PMR is shown in Fig. 1(b). In this simulation, the width and height of the waveguide are 450 nm and 220 nm, respectively. The radius of the PMR is 10 $\mu m$, and the gap between the straight waveguide and the ring is 0.2 $\mu m$. For resonance-state MR, the input signal will be coupled into MR and completely output from the drop port. In our design, the two PMRs have the same resonance wavelength and always remain in the resonance state. Thus, when the input signal’s wavelength is the same as the resonance wavelength of the MR, all-optical signals are coupled to the output port without any bias voltage. The spacing of adjacent wavelength channels remains unchanged, avoiding the accuracy loss caused by different bias voltages on each channel.

Fig. 1. (a) GST-PMR-based photonic dot-product engine. (b) Transmission spectrum of each passive MR

Download Full Size | PDF

In the dot-product engine, GST is used to store the kernel weights. GST is a kind of photonic phase change material, which can perform more than 5-bit storage [32]. It can be embedded in a waveguide without affecting the area, as shown in Fig. 2(a). Due to the evanescent coupling effect between the waveguide and the GST cell, the GST can quickly absorb energy from input signals. If the energy absorbed is high enough, the phase state of the GST can be changed. When GST is in the crystalline state (c-state), it has strong absorption, resulting in no signal passing through the waveguide (transmittance is 0). When GST is in the amorphous state (a-state), the absorption of GST decreases, and it does not affect the output power of the waveguide (transmittance is 1). When GST is in the intermediate state, its transmittance is between 0 and 1. Here, a GST-based waveguide can be understood as an optical attenuator. Theoretically, the attenuation coefficient of the GST-based waveguide can be expressed as Eq. (1).

(1)$$\alpha=exp\left(-\frac{2\pi}{\lambda}\kappa_{eff,wg\_GST}\cdot L_{GST}\right)$$

where $\alpha$ is the attenuation coefficient of the GST-based waveguide, $\lambda$ is the wavelength of light, $\kappa _{eff,wg\_GST}$ is the imaginary part of the effective refractive index($\varepsilon _{eff}$) and $L_{GST}$ is the length of the GST. The $\varepsilon _{eff}$ can be estimated approximately by the effective-medium theory [33]:

(2)$$\frac{\varepsilon_{eff}\left(q\right)-1}{\varepsilon_{eff}\left(q\right)+2}=q\times\frac{\varepsilon_{c}-1}{\varepsilon_{c}+1}+\left(1-q\right)\times\frac{\varepsilon_{a}-1}{\varepsilon_{a}+1}$$

where $q$ is the crystallization degree of GST, $\varepsilon _{a}$ and $\varepsilon _{c}$ are the permittivities in the amorphous and crystalline states, respectively calculated from the effective refractive index of GST. Therefore, GST in different phase states has different effective refractive indexes, resulting in different absorption rates of the waveguides. Figures 2(b)-(f) show the simulation results for the GST, and the relevant simulation settings are shown in Fig. 2. Specifically, Fig. 2(b) shows the transmittance of the optical waveguide under different GST lengths and phase states. Figure 2(c)-(f) show the E-field distribution on the cross-section of the without GST waveguide, a-state GST-based waveguide, and c-state GST-based waveguide, respectively. Figure 2(f) shows the E-field distribution and absorption rate of the waveguide under different GST states and the waveguide without GST.

Fig. 2. (a) GST based on-chip memory. (b) Transmittance of the GST-based waveguide under different GST lengths and different phase states. (c)-(f) E-fileds distribution and absorption rate of the optical waveguide under different GST states.

Download Full Size | PDF

Each GST element should support two operations, write weights and read storage information. The operating principle is shown in Fig. 1(a). During the write operation, a nanosecond pulse signal carrying certain energy is injected into the optical waveguide and then coupled to the GST after passing through the MR, the energy absorbed by GST reaches its crystallized energy threshold, causing its phase state to change. The switching time for different phase states is subnanosecond [34]. For example, the authors in [32] use a rectangular programming pulse of 50 ns to store 34 unique transmission levels in a single cell of GST, and the programming power is between 68 pJ and 135 pJ. The weight data value can be maintained for a long time, i.e., non-volatile. During the read operation, the readout is performed with subnanosecond readout pulses (tens of picoseconds) with low-power (tens of fJ). Therefore, its energy is far below the energy threshold of GST crystallization and does not affect the state of GST.

Because the ANN model can be pre-trained, our design is only used for inference, and the weights of the GSTs are configured offline. The weight can be read with a short and low-power optical pulse signal and used to perform multiplication. As shown in Fig. 1(a), assuming the transmission rate of the GST is $b$, the input power is $a$, and the transmittance of MRs are $m$, thus the output power is $c=a\times b\times m$. Since the weights do not need to be read from the external memory and the modulation process of the MR is not required, our GST-PMR-based dot-product engine can realize optical in-memory multiplication at the speed of light. Usually, we hope that the transmittance of GST can represent the weight value in the neural networks, and $m$ is less than 1 but fixed, so the input power of the laser can be increased by $1/m$ times during the calculation process to compensate for the power loss caused by MRs.

2.2 Photonic adder based on hybrid MDM-WDM

In order to increase the data-carrying capacity of a single wavelength, MDM technology has been proposed to offer an extra multiplexing dimension in the spatial domain. Moreover, because WDM and MDM target different domains, these two independent degrees of freedom can be used simultaneously to form a kind of hybrid multiplexing technology [26–30]. Figure 3(a) shows the mode distribution in the optical waveguide. The number of modes that can be transmitted in a single waveguide increases as the width of the waveguide increases. For example, when the waveguide width is greater than 1.6 $\mu m$, the waveguide can support four modes (TE$_0$(M$_1$), TE$_1$(M$_2$), TE$_2$(M$_3$), TE$_3$(M$_4$)). Figure 3(b) shows the principle of the on-chip hybrid MDM-WDM transmission scheme. The x- and y- coordinates represent different wavelengths and modes. A single wavelength (mode) signal can carry multiple signals with different modes (wavelengths). Thus, if we combine the MDM with WDM in PANN accelerator, the number of available channels for parallel computing could be multiplied.

Fig. 3. (a) Number of modes supported under different waveguide widths. (b) MDM-WDM transmission. (c) Mode converter. (d) Mode multiplexer

Download Full Size | PDF

Usually, a laser only works in the single-mode, so for MDM, the mode converter is required. Figure 3(c) and 3(d) show the principle of mode converter and mode multiplexer, respectively. Each converter consists of two side-by-side waveguides with different widths. Because the number of modes supported in a waveguide would increase if the waveguide width increases. Supposing Waveguide 1, 2, and 3 support mode 1 (M$_1$), modes 1 and 2 (M$_1$ and M$_2$), and modes 1, 2, and 3 (M$_1$, M$_2$, and M$_3$), respectively. When the M$_1$ signal in Waveguide-1 passes through the coupling region, it is possible to excite the M$_2$ (M$_3$) signals in the Waveguide-2(3) once the phase-matching condition is reached, so that the mode is converted. A mode multiplexer is thus formed by connecting different pattern converters through tapers, as shown in Fig. 3(d). In the hybrid MDM-WDM computing system, assuming the input power of the $i-th$ wavelength signal and the $j-th$ mode signal is $P_{W_i,M_j}$, where $i\in \left ( 1,2,3,\ldots,W \right ) ,\ j\in \left ( 1,2,3,\ldots,P \right )$, thus the following addition operation can be performed in one cycle:

(3)$$P_{output}=\sum_{i=1}^W{\sum_{j=1}^P{P_{W_i,M_j}}}$$

2.3 StarLight architecture

Figure 4 shows the proposed PANN accelerator architecture, StarLight. It consists of multiple tiles connected by a optical network-on-chip (ONoC) (Fig. 4 (a)). Each tile communicates with others by its router, and each tile has some photonic processing unit (PPUs) (Fig. 4 (b)). Each PPU includes one PVMM, which utilizes hybrid MDM-WDM technology to boost the CNN inference throughput with minimal structural complexity. Besides, we integrate GST-PMR-based photonic dot-product engines in PVMM for ‘in-memory’ computing with high frequency and low energy consumption. Figure 4 (c) shows the PPU based on hybrid MDM-WDM and the photonic dot-produce engines. Its function is to perform MAC operation between matrix $\boldsymbol{A}$ (kernels) and vector $\boldsymbol{B}$ (input). For the PPU featuring $P$ modes, it has a fundamental mode (M$_1$) continuous wavelength (CW) lasers array with different wavelengths, a $1\times P$ splitter and $P$ $1\times M$ splitters, and a multilayer PVMM to store kernel matrix $\boldsymbol{A}$ and weight and input signals. Elements in vector $B$ are denoted by the PVMM input power values produced by $N$ CW laser array and modulated by $N$ MRs. Then, different wavelength signals with M$_1$ are multiplexed into an optical waveguide through a multiplexer and then divided into $P$ parts through a $1\times P$ splitter. The split signals are then transmitted to different planes to conduct parallel computation and generate output through the port of a mode multiplexer. The PVMM has $P$ layers, and each layer is a parallel array consisting of a series of GST-PMR-based dot-produce engines. Once the transmittances of GSTs are set as completed, the value can be fixed without any external control energy. Optical signals with the same power but different wavelengths are applied to all the rows by a $1\times M$ splitter, ensuring that each neuron has the same input power. The MRs in each row has different resonance wavelengths to avoid wavelength conflicts and achieve an ultra-high degree of parallelism. For MDM, the optical signal with the power $b_i$, $(i=1,2,\ldots,M)$ is injected into the input port for each row in all mode layers, as shown in Fig. 4(d). Assume the weight of GST in the $i$-th row, $j$-th column, and $p$-th layer in unit $\boldsymbol{A}$ is set to $a_{p_{(i,j)}}$. Each input WDM signal ($\boldsymbol{B}$) is mapped to different rows of the GST-PMR-based parallel array by the splitter. After passing through the dot-produce unit with weight $a_{p_{(1,i)}}$, the optical signals are coupled to the first row of the matrix, and the output powers are $b_i\cdot a_{p_{(i,j)}}$, respectively. Since all dot-produce units in the first row of different mode layers operate at different wavelengths and modes, the MAC result is thus computed by superimposing all units in the first row.

Fig. 4. StarLight top-level architecture. (a) The interconnect structure between different tiles. (b) The interconnection between PPUs within each tile. (c) The internal structure of each PPU unit. (d) Schematic diagram of hybrid MDM-WDM in each PVMM, taking the first row in each layer as an example.

Download Full Size | PDF

By changing the crystallization degree of GST, the weight in the range [0, 1] can be represented, but in practical applications, negative weights exist. Therefore, we refer to the separation weight generation (SWG) algorithm to represent weights in the range [-1, 1]. The principle of the SWG algorithm is shown in Algorithm 1. Figure 5(a) shows the weighting process using the PVMM structure combined with the SWG algorithm. As shown in Fig. 5(a), at first, the optical signal is modulated and divided into two paths through a splitter with a splitting ratio of 50:50, and then enters two PVMM structures for weighting. In this way, the GST weight in each PVMM structure no longer corresponds to the weight information in the original complete weight matrix but corresponds to the weight information processed by the SWG algorithm.

Algorithm 1. Separation Weight Generation (SWG)

View Table

Fig. 5. (a) Weighting process using SWG algorithm. (b) Operation result waveform.

Download Full Size | PDF

The weight matrix $\boldsymbol{W}$ containing negative weight values is separated into two sub-matrices $\boldsymbol{Wp}$ and $\boldsymbol{Wn}$, where the $\boldsymbol{Wp}$ matrix only contains all non-negative elements in $\boldsymbol{W}$. The other elements in the original matrix are replaced by 0, and the $\boldsymbol{Wn}$ matrix only contains the absolute value of all negative elements in the original matrix $\boldsymbol{W}$, and the other elements in the original matrix are replaced by 0. Then, the weight values in the two sub-matrices obtained after separation are mapped to the GSTs in the upper and lower PVMMs to modulate the two optical signals by positive and negative weight values, respectively. Then the two outputs for positive weight and negative weight are calculated separately. Finally, the photodetector receives the two weighted optical signals as electrical signals, and subtraction is performed in the electrical domain to obtain the final result. The process can be expressed as:

(4)$$Y=W\cdot X=Wp\cdot X-Wn\cdot X$$

Figure 5(b) shows the final waveform of the operation weighted using PVMM after combining this algorithm.

3. Simulation results and discussions

In this section, to verify the theoretical correctness of our design, we use ANSYS Lumerical Solutions [35] to conduct device and system levels modeling and hardware parameter optimization for $4\times 4\times 4$ PVMM, providing guidance for layout design. Then, in order to verify the availability of the PVMM module, we realize the task of multi-classification on the Iris Dataset using MATLAB and the Lumerical platform. Finally, we perform the scalability analysis on StarLight to evaluate the maximum computing capability that the proposed architecture can provide.

3.1 Device parameters optimization and performance analysis

$4\times 4\times 4$ PVMM using four wavelengths and four modes. For four modes of multiplexing, We optimize the width of waveguide 1 to waveguide 4, coupling lengths, and coupling gaps in different mode converters to ensure the phase matching among the different modes in the waveguides. As the optimized results, the width of waveguide 1 to waveguide 4 are 0.45 $\mu m$, 1.0 $\mu m$, 1.58 $\mu m$, and 2.17 $\mu m$, respectively. The gaps between waveguide 1 and waveguides 2/3/4 are 0.2 $\mu m$, while the coupling lengths between waveguide 1 and waveguides 2/3/4 are 15 $\mu m$, 16.5 $\mu m$ and 17.7 $\mu m$, respectively. Figure 6 shows the simulation results of E-filed distribution for different mode converters.

Fig. 6. E-filed distribution of the mode converters. (a) TE$_0$ to TE$_1$, CE=99$\%$, IL=0.05 dB. (b) TE$_0$ to TE$_2$, CE=90$\%$, IL=0.45 dB. (c) TE$_0$ to TE$_3$, CE=73$\%$, IL=1.36 dB.

Download Full Size | PDF

In order to realize the multiplexing of four wavelengths, four MRs with different resonance wavelengths need to be designed. In this design, the radii of the four MRs are close to 10 $\mu m$, for a free spectral range (FSR) of 10 $nm$. The small difference in the radii is set to separate its different resonance wavelengths. We also optimize the coupling gap for four MRs to ensure each MR is close to the critical coupling state so that the extinction ratio (ER) can be maximized. Figure 7(a) shows the transmission spectra of the through port of the upper MR and the drop port of the lower MR in the dot-product engine without GST. For the resonance state MRs, the insertion loss of the drop port is less than 0.1 dB, and the crosstalk to the through port is less than -50 dB. For the non-resonant state MRs, the insertion loss of the through port is less than 0.05 dB, and the crosstalk to the drop is less than -30 dB without considering the production error. Figure 7(b-c) shows the E-filed distribution of the GST-PMR based dot-product engine where the input power is set to 1. When there is no GST in the dot-product engine, almost all the light can be transmitted from the output port, as shown in Fig. 7(b). By adjusting the length and crystallization rate of GST, the output power of the output port can be changed. It can be seen from the Fig. 7(c) that when the length of GST is 3 $\mu m$ and the crystallization rate is 0.47, the output power is 0.63.

Fig. 7. (a)Transmission spectra of four different MRs.(b-c) E-filed distribution of the dot-product engine. (b) without GST. (c) The length of GST is 3 $\mu m$ and the crystallization rate is 0.47.

Download Full Size | PDF

3.2 Simulation verification of MDM-WDM PVMM

We use four 4$\times$4 PVMMs to simulate the process of weighting and adding four matrices simultaneously, and the results are simulated by the professional optical simulation software Lumerical INTERCONNECT. We got the output waveforms with the input bit rate of 10 Gbit/s, 15 Gbit/s, and 20 Gbit/s, respectively, as shown in Fig. 8. The input vector $B=[b1;b2;b3;b4]$ is modulated by four $2^7-1$ pseudo-random binary sequences (PRBS) from pattern generators. The values in the kernel matrix $\boldsymbol{A}$ are randomly generated and loaded into corresponding GST cells. Assuming $C=[c1;c2;c3;c4]$ is the final result, thus the operation relationship between them as shown in Eq. (5).

(5)$$\begin{aligned} &{\left[\begin{array}{l} c_1 \\ c_2 \\ c_3 \\ c_4 \end{array}\right]=\underbrace{\left[\begin{array}{cccc} 1 & 0 & 0.5 & 1 \\ 0 & 1 & 1 & 1 \\ 1 & 0.5 & 0 & 1 \\ 0 & 1 & 1 & 0 \end{array}\right]}_{A_1} \times\left[\begin{array}{l} b_1 \\ b_2 \\ b_3 \\ b_4 \end{array}\right]+\underbrace{\left[\begin{array}{cccc} 0.5 & 1 & 0 & 1 \\ 0.5 & 1 & 1 & 0.5 \\ 1 & 1 & 0.5 & 1 \\ 1 & 0 & 0 & 1 \end{array}\right]}_{A_2} \times\left[\begin{array}{l} b_1 \\ b_2 \\ b_3 \\ b_4 \end{array}\right]} \\ &+\underbrace{\left[\begin{array}{cccc} 1 & 0 & 1 & 0.5 \\ 1 & 0.5 & 1 & 1 \\ 1 & 0.5 & 0 & 1 \\ 0.5 & 1 & 0 & 1 \end{array}\right]}_{A_3} \times\left[\begin{array}{l} b_1 \\ b_2 \\ b_3 \\ b_4 \end{array}\right]+\underbrace{\left[\begin{array}{cccc} 0 & 0.5 & 0.5 & 1 \\ 0.5 & 1 & 1 & 0 \\ 0.5 & 0 & 1 & 1 \\ 1 & 0.5 & 1 & 0 \end{array}\right]}_{A_4} \times\left[\begin{array}{c} b_1 \\ b_2 \\ b_3 \\ b_4 \end{array}\right] \end{aligned} $$

Fig. 8. Simulation waveforms for 4$\times$4$\times$4 PVMM with 10 Gbps, 15 Gbps and 20 Gbps.

Download Full Size | PDF

It can be seen that different input bit rates have a great impact on the results. We calculated the error rate of sample points for each period, which is about 1.12$\%$, 1.11$\%$ and 7.31$\%$ at the input bit rates of 10 Gbps, 15 Gbps and 20 Gbps, respectively. We can see that, at the rate of 15 Gbps, our PVMM still has a small error. This is because, in our PVMM, the weight change is achieved by controlling the transmittance of GST, while the wavelength channel interval between different MRs is fixed, thus the crosstalk is negligible. Therefore, the calculation results are only affected by the transmission loss of the optical signal in the optical waveguide, so a higher accuracy can be obtained compared with the designs we mentioned in the introduction section.

In this simulation, we utilize multiplexing of four modes and four wavelengths. Each layer uses the TE$_0$ signal to perform a 4$\times$4 matrix multiplication operation. Finally, the 2nd, 3rd, and 4th layer output signals are converted into TE$_1$, TE$_2$, TE$_3$ modes through the mode converter, and finally multiplexed. Since the loss of each mode conversion is fixed, the input power of different layers can be compensated to ensure the accuracy of the calculation. If the input signal power of the first layer is 1, the input power of the fourth layer can be set to 1.37, since the loss of the mode converter from TE$_0$ to TE$_3$ is 0.73. In addition, the loss of compensation power can be reduced by further optimizing different mode converters.

3.3 Area and power consumption and calculation ability evaluation

To evaluate the area overhead of the StarLight, an accurate understanding and description of every optical component are needed. The 4$\times$4$\times$4 PVMM based StarLight contains 4 EO-MRs for modulation, 64 GST-PMR dot-product engines, five 1$\times$4 splitters, and four 4-mode mode multiplexers. Since the radius of MR used in our simulation is around 10 $\mu m$, we estimate the maximum chip area by assuming each EO-MR takes 20$\times$20 $\mu m^2$, and overhead circuitry increases the footprint to 25$\times$25 $\mu m^2$. Also, each EO-MR needs two electrical pads, and the footprint of each electrical pad is 150$\times$150 $\mu m^2$ [31]. Totally, the footprint of each GST-PMR dot-product engine is 50$\times$45 $\mu m^2$. The distance between two neighboring engines is assumed to be 10 $\mu m$ to avoid interference between different engines. The area of 1$\times$2 splitter is 32.5 $\mu m\times$6 $\mu m$ [36]. According to the simulation results in Fig. 6, the area of each 4-mode mode multiplexer is less than 100 $\times$ 10 $\mu m^2$. Thus, the chip area of the 4$\times$4$\times$4 PVMM is less than 0.4 $mm^2$, and it is capable of operating at speeds of 0.96$\times$10$^{12}$ MAC/s, the computing density is more than 2.4 TMAC/s/mm$^2$. Another highlight of the proposed PVMM is the energy consumption advantage. In the inference process, since the weights have been stored in GST beforehand and the value of the kernel matrix does not need to be changed. Thus, the kernel weight matrix has zero-power consumption performance. The energy consumed in the whole inference process of PVMM comes entirely from the power consumption the EO-MR. Let’s make a simple comparison, according to [20], the power consumption of each EO-MR is 19.5 mW. Our 4$\times$4$\times$4 PVMM only need four EO-MRs, thus the power consumption is 0.078 W. For traditional MR-based PVMM, such as DEAP, performing a 4$\times$4$\times$4 scale MAC operations requires 72 EO-MRs (8 for input vector and 64 for kernel matrix), the total power consumption is 1.404 W. This shows that the proposed PVMM has excellent power consumption performance.

3.4 Image classification task with an ANN model

In order to verify the availability of the StarLight module, we realize the task of multi-classification of Iris Dataset using MATLAB and the Lumerical platform. The Iris Dataset contains a total of 150 groups of data. Each data group includes four characteristic values of Iris: Calyx length, Calyx width, Petal length and Petal width, Iris is divided into three categories (Setosa, Versicolor and Virgrinica). The ANN model is shown in Fig. 9(a), which includes one input layer, one hidden layer and one output layer.

Fig. 9. Classification task of Iris Dataset.

Download Full Size | PDF

Figure 9(c) demonstrates the basic process of the classification. We divided the Iris Dataset into two parts, 100 items are used as training sets to calculate the ideal weight of the network, and the remaining 50 items are set as test sets for network analysis. Firstly, we train the model using Pytorch framework on the 64-bit workstation. Then, we get the GST crystallization rate with different weights from the weight matrix. And finally we load the GST data into the PVMM module. As shown in Fig. 9(b), we use simulation software to inference testing sets. We map the four input eigenvalues to the optical signals with different wavelengths realize the weighting process in Lumerical, and then extract the weighting results into MATLAB through the API between Lumerical and MATLAB for nonlinear activation to obtain the classification results. Finally, the accuracy of Iris Dataset classification using this method is 96%, which is completely consistent with the accuracy on a 64-bit computer. Figure 9(d) shows the confusion diagram of the results of Iris Dataset classification.

4. Conclusion

This paper proposed a photonic ANN accelerator, StarLight, to maximize ANN inference performance. It achieved a high inference throughput by exploiting the hybrid MDM-WDM method, which is an effective way to increase throughput without changing the number of lasers. We also integrated GST and passive MRs in StarLight to achieve in-situ photonic processing with near-zero power consumption. Unlike the traditional PCM-MR-based PANN accelerator, StarLight does not need to integrate PCM into MRs, avoiding interference between MRs under different weight configurations, thereby improving inference accuracy and scalability. We implemented a simulation on the Iris dataset classification using a 4$\times$4$\times$4 PVMM and achieved an inference accuracy of 96$\%$. With 15G modulation speed, the computing density is more than 2.4 TMAC/s/mm$^2$. In addition, we will consider other excellent optimizations, such as increasing the number of mode multiplexing, using optical frequency comb technology to increase the number of wavelength multiplexing, and increasing the modulation speed, to further improve computing density in the future. Therefore, StarLight holds promise for realizing high-performance neural network hardware accelerators to address the incoming challenges of data-intensive AI applications such as intelligent healthcare and autonomous driving.

Funding

National Key Research and Development Program of China (2018YFE026800); National Natural Science Foundation of China (62071076, 62075024, 62205043, 62222103); Chongqing Postdoctoral Science Foundation (2010010006251081); Chongqing Top-notch Youth Talent Support Project (CQYC201905075); Natural Science Foundation of Chongqing (cstc2019jcyj-msxmX0615); Chongqing Municipal Education Commission (CXQT21019).

Acknowledgments

This work was supported in part by the National Key R&D Program of China, and in part by the National Natural Science Foundation of China, Chongqing Postdoctoral Funding Project, the Chongqing Top-notch Youth Talent Support Project, and the Nature Science Foundation of Chongqing.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE J. Solid-State Circuits 52(1), 127–138 (2017). [CrossRef]

2. A. Krizhevsky, T. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM 60(6), 84–90 (2017). [CrossRef]

3. Google. (2020). Google Assistant, Your own personal google. [Online].Available: https://assistant.google.com.

4. L. R. Juracy, M. T. Moreira, A. de M. Amory, A. F. Hampel, and F. G. Moraes, “A high-level modeling framework for estimating hardware metrics of CNN accelerators,” IEEE Trans. Circuits Syst. I 68(11), 4783–4795 (2021). [CrossRef]

5. A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, “Deep learning with COTS HPC systems,” in Int. conf. on Machine learning (2013), pp. 1337–1345.

6. C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpga-based processor for convolutional networks,” in Int. Conf. Field Program. Log. Appl. (2009), pp. 32–37.

7. A. Graves, G. Wayne, M. Reynolds, et al., “Hybrid computing using a neural network with dynamic external memory,” Nature 538(7626), 471–476 (2016). [CrossRef]

8. Y. Chen, L. Liu, S. Zhang, and O. Temam, “Dadiannao: A machine-learning supercomputer,” in 47th Annu. IEEE/ACM Int. Symp. Microarchit. (2014), pp. 609–622.

9. A. Shafiee, A. Nag, N. Mura, R. Balasubra, J. Strachan, H. Miao, R. Williams, and V. Srikumar, “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (2016), pp. 14–26.

10. P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. Yang, and H. Qian, “Fully hardware-implemented memristor convolutional neural network,” Nature 577(7792), 641–646 (2020). [CrossRef]

11. C. Li, D. Belkin, Y. Li, Y. Peng, H. Miao, N. Ge, J. Hao, E. Mont, L. Peng, and Z. Wang, “Efficient and self-adaptive in-situ learning in multilayer memristor neural networks,” Nat. Commun. 9(1), 1–8 (2018). [CrossRef]

12. Y. Shen, N. C. Harris, S. Skirlo, D. Englund, and M. Soljacic, “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11(7), 441–446 (2017). [CrossRef]

13. M. Y. S. Fang, S. Manipatruni, C. Wierzynski, A. Khosrowshahi, A. Khosrowshahi, and M. R. DeWeese, “Design of optical neural networks with component imprecisions,” Opt. Express 27(10), 14009–14029 (2019). [CrossRef]

14. F. Shokraneh, S. G. Gagnon, and O. L. Ladouceur, “The diamond mesh, a phase-error-and loss-tolerant field-programmable MZI-based optical processor for optical neural networks,” Opt. Express 28(16), 23495–23508 (2020). [CrossRef]

15. H. Zhang, M. Gu, X. D. Jiang, J. Thompson, H. Cai, S. Paesani, R. Santagati, A. Laing, Y. Zhang, and M. H. Yung, “An optical neural chip for implementing complex-valued neural network,” Nat. Commun. 12(1), 1–11 (2021). [CrossRef]

16. A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Broadcast and weight: An integrated network for scalable photonic spike processing,” J. Lightwave Technol. 32(21), 4029–4041 (2014). [CrossRef]

17. A. N. Tait, T. F. De Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Neuromorphic photonic networks using silicon photonic weight banks,” Sci. Rep. 7(1), 7430 (2017). [CrossRef]

18. D. Dang, J. Dass, and R. Mahapatra, “ConvLight: A convolutional accelerator with memristor integrated photonic computing,” in IEEE 24th Int. Conf. High Perform. Comput. (2017), pp. 114–123.

19. A. Mehrabian, Y. Al-Kabani, V. J. Sorger, and T. El-Ghazawi, “PCNNA: A photonic convolutional neural network accelerator,” in 31st IEEE Int. Syst-on-Chip Conf. (2018), pp. 169–173.

20. V. Bangari, B. A. Maequez, H. Miller, A. N. Tait, M. A. Nahmias, T. F. Lima, H. Peng, P. R. Prucnal, and B. J. Shastri, “Digital electronics and analog photonics for convolutional neural networks (DEAP-CNNs),” IEEE J. Sel. Top. Quantum Electron. 26(1), 1–13 (2020). [CrossRef]

21. W. Liu, W. Liu, Y. Ye, Q. Lou, Y. Xie, and L. Jiang, “HolyLight: A nanophotonic accelerator for deep learning in data centers,” in Des. Automat. & Test in Europe Conf. & Exhibition (DATE) (2019), pp. 1483–1488.

22. P. Guo, W. Hou, L. Guo, W. Sun, C. Liu, H. Bao, L. H. K. Duong, and W. Liu, “Fault-tolerant routing mechanism in 3d optical network-on-chip based on node reuse,” IEEE Trans. Parallel Distrib. Syst. 31(3), 547–564 (2020). [CrossRef]

23. P. Guo, W. Hou, L. Guo, Z. Cao, and Z. Ning, “Potential threats and possible countermeasures for photonic network-on-chip,” IEEE Commun. Mag. 58(9), 48–53 (2020). [CrossRef]

24. J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. L. Gallo, X. Fu, A. Lukashchuk, A. Raja, and J. Liu, “Parallel convolutional processing using an integrated photonic tensor core,” Nature 589(7840), 52–58 (2021). [CrossRef]

25. I. Chakraborty, G. Saha, and K. Roy, “Photonic in-memory computing primitive for spiking neural networks using phase-change materials,” Phys. Rev. Appl. 11(1), 014063 (2019). [CrossRef]

26. X. Wu, C. Huang, K. Xu, C. Shu, and H. K. Tsang, “Mode-division multiplexing for silicon photonic network-on-chip,” J. Lightwave Technol. 35(15), 3223–3228 (2017). [CrossRef]

27. P. Guo, W. Hou, L. Guo, Z. Ning, M. S. Obaidat, and W. Liu, “WDM-MDM silicon-based optical switching for data center networks,” in Proc. IEEE Int. Conf. Commun. (2019), pp. 1–6.

28. L. W. Luo, N. Ophir, C. P. Chen, L. Gabrielli, C. B. Poitras, K. Bergmen, and M. Lipson, “WDM-compatible mode-division multiplexing on a silicon chip,” Nat. Commun. 5(1), 3069 (2014). [CrossRef]

29. W. Hou, P. Guo, L. Guo, X. Zhang, H. Chen, and W. Liu, “O-Star: An optical switching architecture featuring mode and wavelength-division multiplexing for on-chip many-core systems,” J. Lightwave Technol. 40(1), 24–36 (2022). [CrossRef]

30. D. Dai, C. Li, S. Wang, H. Wu, Y. Shi, Z. Wu, S. Gao, T. Dai, H. Yu, and H. K. Tsang, “10-Channel Mode (de)multiplexer with dual polarizations,” Laser Photonics Rev. 12(1), 1700109 (2018). [CrossRef]

31. D. Nikolova, D. M. Calhoun, Y. Liu, S. Rumley, A. Novack, T. Baehr-Jones, M. Hochberg, and K. Bergman, “Modular architecture for fully non-blocking silicon photonic switch fabric,” Microsyst. Nanoeng. 3(1), 16071 (2017). [CrossRef]

32. X. Li, N. Youngblood, C. Ráos, Z. Cheng, and H. Bhaskaran, “Fast and reliable storage using a 5-bit, nonvolatile photonic memory cell,” Optica 6(1), 1–6 (2019). [CrossRef]

33. N. V. Voshchinnikov, G. Videen, and T. Henning, “Effective medium theories for irregular fluffy structures: aggregation of small particles,” Appl. Opt. 46(19), 4065–4072 (2007). [CrossRef]

34. C. Rlos, M. Stegmaier, P. Hosseini, D. Wang, T. Scherer, C. D. Wright, H. Bhaskaran, and W. H. P. Pernice, “Integrated all-photonic non-volatile multi-level memory,” Nat. Photonics 9(11), 725–732 (2015). [CrossRef]

35. [Online]. Available: https://www.lumerical.com/cn/.

36. J. Ding, F. Zhang, W. Zhu, P. Zhou, Q. Chen, L. Zhang, and L. Yang, “Optical digital to analog converter based on microring switches,” IEEE Photonics Technol. Lett. 26(20), 2066–2069 (2014). [CrossRef]

StarLight: a photonic neural network accelerator featuring a hybrid mode-wavelength division multiplexing and photonic nonvolatile memory

Abstract

1. Introduction

2. StarLight design

2.1 Photonic dot-product engine based on GST and MRs

2.2 Photonic adder based on hybrid MDM-WDM

2.3 StarLight architecture

3. Simulation results and discussions

3.1 Device parameters optimization and performance analysis

3.2 Simulation verification of MDM-WDM PVMM

3.3 Area and power consumption and calculation ability evaluation

3.4 Image classification task with an ANN model

4. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (1)

Equations (5)

Optics Express