Two-layer integrated photonic architectures with multiport photodetectors for high-fidelity and energy-efficient matrix multiplications

Rui Tang; Makoto Okano; Kasidit Toprasertpong; Shinichi Takagi; Dirk Englund; Mitsuru Takenaka; Mitsuru Takenaka

doi:10.1364/OE.457258

1. Introduction

Deep learning is revolutionizing a wide range of scientific fields and industries. Matrix multiplications are indispensable to deep learning but computationally heavy for general-purpose central processing units (CPUs), so graphics processing units (GPUs) or application-specific integrated circuits (ASICs) are usually used to accelerate the computation [1]. However, further improving the performance of electronic processors becomes progressively more difficult because of the slowing down of Moore’s law. Recently, photonic integrated circuits (PICs) are emerging as a promising tool for accelerating matrix multiplications, more specifically, matrix-vector multiplications (MVMs), in deep learning [2–9]. Photons have notably lower energy loss during transmission and higher bandwidth than electrons. An optical MVM accelerator is therefore expected to have a higher processing speed and energy efficiency than the electronic counterpart [7].

In the seminal work of Y. Shen et al. [2], a coherent linear optical processor was first used to accelerate the MVM in deep learning, which then enlightened numerous works that seek to demonstrate large-scale chips or further improve the device performance [10–22]. In this scheme, a weight matrix in the neural network is first decomposed into the product of two unitary matrices and a diagonal matrix by singular value decomposition (SVD) [23,24]. The two unitary matrices are then implemented by two universal unitary multiport interferometers (UUMIs) composed of a mesh of tunable Mach-Zehnder interferometers (MZIs) [25–27]. However, the SVD itself is time-consuming and consequently the acceleration of MVM is only feasible for pre-decomposed matrices. Therefore, firstly, the application of this scheme is almost restricted to the inference of static neural networks; secondly, it requires a huge amount of external memory to store all the decomposed results, especially for deep neural networks with billions of parameters. Another coherent scheme using a photonic crossbar for MVM was also demonstrated recently [28,29]. In addition to the coherent scheme, noncoherent optical MVM accelerators based on wavelength-division multiplexing (WDM) are also proposed and demonstrated [30–37]. These devices use tunable microring resonators (MRRs) to modulate and multiplex optical signals at different wavelengths and then noncoherently add them up with a photodetector (PD) array. However, simultaneously controlling hundreds of MRRs is technically challenging due to the high sensitivity of MRRs. Therefore, MRR-based devices that support a matrix dimension larger than 10 × 10 have never been demonstrated.

Although the schemes mentioned above focus on the acceleration of MVM, it is important to note that in many cases, the general matrix-matrix multiplication (GEMM) dominates the computations in deep learning because of batch processing [4,7,38,39]. In this context, multiple input vectors are grouped together into a matrix and multiplied by a weight matrix. To perform this multiplication on a MVM device, the device has to work either sequentially or parallelly: sequential MVMs process one vector in one cycle and therefore require multiple cycles; parallel MVMs process each vector at a different device and therefore require multiple devices working simultaneously. Compared with these, a GEMM device which can process multiple vectors simultaneously in one cycle without adding additional modulators for the weight matrix, is highly desirable because on one hand, it has a significantly higher throughput than a sequential MVM device; on the other hand, it requires less optical modulators than parallel MVM devices. Optical GEMM accelerators have been investigated for decades using free-space devices [40–43]. In integrated photonics, the GEMM has been demonstrated first using off-chip wavelength multiplexers and recently on-chip wavelength multiplexers [6,44].

In this paper, we first propose a novel integrated photonic architecture for MVM, based on the state-of-the-art two-layer waveguide platform. Device configuration in this scheme is straightforward and therefore complicated matrix decomposition is no longer required. More importantly, in contrast to all previous architectures, this architecture features an intrinsically small hardware error that does not increase with the device scale, which is crucial for high fidelity matrix multiplications. Moreover, by adding the wavelength degree of freedom, we further develop this concept and propose an integrated photonic architecture for GEMM, which incorporates on-chip wavelength multiplexers without creating waveguide crossings. This architecture allows the GEMM to be directly performed on a photonic chip with a high energy efficiency unattainable by parallel or sequential MVMs.

2. Architectures

Figures 1(a),(b) illustrate two slightly different versions of our proposed integrated photonic architecture for MVM. In Fig. 1(a), continuous-wave (CW) light at a single wavelength is sent into the input port and equally split by a cascaded stage of optical splitters. Light intensities inside the waveguides are then modulated by a modulator array to encode a column vector. These modulated optical signals are further equally split and guided to the matrix encoding region via the waveguides in the second layer. Here, the 1 × 2 splitters depicted in Fig. 1 are for illustration purposes only. We can also use, for example, highly-uniform 1 × 3 splitters or a combination of various splitters to support numbers other than the power of two [45]. While not all integer numbers can be supported in this way, it may not be a serious problem in practical applications. The second layer avoids the use of waveguide crossings [46], which can cause undesired crosstalks and unbalanced path losses that contribute to computation errors, all paths thus can have the same loss simply by equalizing the path length. Without the second layer, the same structure implemented on a single-layer platform will result in unbalanced numbers of waveguide crossings for all paths, ranging from 0 to (N-1)² for an N×N device. For large-scale devices, differences among the insertion loss of all paths are not negligible. The multi-layer waveguide platform is a promising direction for photonic integration since it offers excellent flexibility in waveguide routing and allows for higher integration densities than the single-layer platform [47,48]. Two-layer waveguide platforms based on Si and SiN are already available in several silicon photonics foundries [49–51], where adiabatic interlayer couplers with a loss of 0.1 dB are available [49]. In our architecture, the distance between the two layers needs be sufficiently large to avoid loss and crosstalks and therefore may require an intermediate layer to facilitate the coupling in the adiabatic interlayer couplers, as demonstrated in [47].

Fig. 1. Proposed integrated photonic architectures for matrix-vector multiplication (MVM) and general matrix-matrix multiplication (GEMM). (a) One version of the architecture for MVM. The input modulator array encodes a column vector into the intensity of optical signals, which are then equally split by a cascaded stage of splitters and guided to the matrix encoding region via the waveguides in the second layer (represented by blue lines). In the matrix encoding region, each modulator array along the y direction encodes a row vector of the matrix and performs element-wise multiplications between the input column vector and the matrix row. Finally, these twice modulated optical signals are automatically added up by a multiport photodetector (PD) during the photoelectric conversion process, and the result of MVM can be acquired by reading the output currents of all PDs. (b) A slightly different version of the architecture for MVM. Directional couplers are used to couple an equal portion of light from the bus waveguide, which may lead to a more compact chip size. (c) The architecture for GEMM. Input light at two different wavelengths (λ₁, λ₂) are modulated by two modulator arrays to encode the column vectors (x₁, x₂) of matrix X. The light encoded with corresponding elements in x₁ and x₂ are multiplexed into the same waveguide by a passive wavelength (de)multiplexer. Using the same splitting and guiding structure as in (a), the multi-wavelength optical signals are then simultaneously modulated by the modulator arrays that encode the row vectors (w₁, w₂, w₃, w₄) of matrix W. After wavelength demultiplexing, the output current of a multiport PD is proportional to w_i x_j (i ∈ {1, 2, 3, 4}, j ∈ {1, 2}) and therefore the result of GEMM can be acquired in the same way as MVM.

Download Full Size | PDF

Figure 1(b) slightly differs from Fig. 1(a) in the way the light is split. Directional couplers are used to couple an equal portion of light from the bus waveguide [8,52,53], which may lead to a more compact chip size. In the matrix encoding region, each modulator array along the y direction encodes a row vector of the matrix and performs element-wise multiplications between the input column vector and the matrix row. Finally, these twice modulated optical signals are automatically added up by a multiport PD during the photoelectric conversion process, and the result of MVM can be acquired by reading the output currents of all PDs. Previously, a 4-port PD was demonstrated to improve the saturation power of the PD [54], but the idea of performing addition with a multiport PD has not been conceived yet.

Adding the wavelength degree of freedom into the architecture in Fig. 1(a), the architecture shown in Fig. 1(c) can perform the GEMM on a single chip. Figure 1(c) illustrates the multiplication between a 4 × 4 matrix ${\mathbf{W}}$ and a 4 × 2 matrix ${\mathbf{X}}$. Input light at two different wavelengths (${\lambda _1}$, ${\lambda _2}$), which can be generated by a compact integrated frequency comb [6], are modulated by two modulator arrays to encode the column vectors (${{\boldsymbol{x}}_1}$, ${{\boldsymbol{x}}_2}$) of ${\mathbf{X}}$. The light encoded with corresponding elements in ${{\boldsymbol{x}}_1}$ and ${{\boldsymbol{x}}_2}$ are multiplexed into the same waveguide by a passive wavelength (de)multiplexer, which can be implemented by an arrayed waveguide grating (AWG), or multiple MRRs [31], or an echelle grating [55], or a compact inversely designed component [56,57]. Using the same splitting and guiding structure as in Fig. 1(a), the multi-wavelength optical signals are then simultaneously modulated by the modulator arrays that encode the row vectors (${{\boldsymbol{w}}_1}$, ${{\boldsymbol{w}}_2}$, ${{\boldsymbol{w}}_3}$, ${{\boldsymbol{w}}_4}$) of matrix ${\mathbf{W}}$. After wavelength demultiplexing, the output current of a multiport PD is proportional to ${{\boldsymbol{w}}_{\boldsymbol{i}}}{{\boldsymbol{x}}_{\boldsymbol{j}}}$ (i ∈ {1, 2, 3, 4}, j ∈ {1, 2}) and therefore the result of GEMM can be acquired in the same way as MVM.

Configuring the modulators in our architectures is straightforward. For MZI modulators, the power transmittance of an ideal single-input, single-output MZI is

(1)$$\frac{1}{2}({1 + \cos \theta } ),$$

where $\theta $ is the phase shift applied to one MZI arm. Therefore, a look-up table that directly maps a matrix/vector element to the applied voltage can be created for each phase shifter, which could be the III-V/Si hybrid metal-oxide-semiconductor (MOS) phase shifter or the micro electromechanical systems (MEMS) phase shifter [58–62]. Meanwhile, since only intensity modulation is needed, the modulators could also be tunable optical absorbers based on phase-change materials [6]. An analysis on the optical power loss is provided in Supplement 1. Since our architectures are non-coherent schemes based on intensity modulation, the modulation loss depends on the weight matrix and therefore is not a constant. Compared with the coherent linear optical processor which does not generate modulation loss for unitary matrices, the modulation loss is in general higher in our architecture. As for the loss induced by waveguide propagation and passive components, the loss of our MVM architecture is in the same order of magnitude as the coherent linear optical processor within a reasonable device scale [see the details in Supplement 1].

3. Multiport photodetector

On-chip multiport PDs are used to add up the intensity of multiple optical signals; the scalability, response speed, dark current, and port uniformity of the multiport PD are therefore vital to the system performance. In silicon photonics platforms, standard germanium (Ge) PDs can be adapted to fulfill the requirements. Figures 2(a),(b) show two conceived designs of the multiport PD, based on the evanescent-coupled vertical p-i-n structure [51]. In both designs, the waveguide gap gradually decreases to have a compact PD size. Strong coupling between adjacent waveguides occurs for a small waveguide gap (e.g., 200 nm), but since the coupling can be described by a unitary transformation, the total energy of input light is conserved and therefore the PD operation should be unaffected. Figure 2(a) shows a design that imposes minimal changes to the standard structure, which may be favored by small-scale devices. Figure 2(b) shows a symmetric design with superior scalability and minimal port difference. The maximum number of PD ports ${n_{\textrm{port}}}$ for the structure in Fig. 2(b) is given by

(2)$${n_{\textrm{port}}} = \left[ {\frac{{2\pi r}}{{w + g}}} \right], $$

where r is the radius of the circular p-doped Si region, w is the waveguide width, and g is the waveguide gap. When $w$ = 400 nm and $g$ = 200 nm, the PD can support more than 100 ports for r > 9.6 µm (corresponds to an area of 290 µm²). The scalability of both designs can be further improved by inversely tapering the waveguide width (e.g., to 300 nm) near the incident interface of the PD.

Fig. 2. Conceived designs of the multiport PD based on the evanescent-coupled vertical p-i-n structure and an estimation of the PD performance. (a) A design that imposes minimal changes to the standard structure, which may be favored by small-scale devices. (b) A symmetric design with superior scalability and minimal port difference. (c) An estimation of the 3-dB bandwidth and the dark current for the design in (b), using the parameters listed in Table 1.

Download Full Size | PDF

Table 1. Parameters used for estimating the PD performance

View Table | View all tables in this article

The bandwidth and dark current have a trade-off with the scalability. The bandwidth decreases with increasing PD area, as determined by the carrier transit time and the RC time constant. The dark current, which determines the minimum detectable optical power, increases with the PD area. Depending on the device size and bias voltage, state-of-the-art Ge PDs typically have a dark current within the range of 1 ∼ 100 nA [51,63,64]. By suppressing the surface leakage current with a thin GeO_x layer on the sidewall, an ultralow dark current density of 0.57 mA/cm² was demonstrated at -1 V reverse bias [65], corresponding to a 5.7 nA dark current for a detector area of 10³ µm². Figure 2(c) shows the estimation of 3 dB bandwidth and dark current for the design in Fig. 2(b), using the parameters listed in Table 1. While it is difficult to obtain high bandwidth, high responsivity, and low dark current simultaneously, for a reasonable number of PD ports (e.g., 64), a 3 dB bandwidth larger than 10 GHz and a dark current less than 10 nA can be achieved in principle. Note that in this estimation, the bandwidth is mainly limited by the RC time constant and therefore can be further enhanced by improving the design. If the number of PD ports becomes an issue in ultralarge-scale devices, we can reduce the number of ports in each PD and combine the output currents of multiple PDs together, as demonstrated in [66].

4. Hardware error

Hardware errors are inevitable in analog computing platforms. In the architectures for MVM [Figs. 1(a),(b)], hardware errors on the PIC primarily originate from the dark current of PDs, the error in phase shifts due to a finite quantization resolution, nonuniform characteristics of optical components such as the splitting ratio of optical splitters, and the unbalanced insertion loss. These errors also exist in previous architectures and an error correction method has been proposed for the UUMI [18]. Directional couplers are known to be sensitive to fabrication accuracy, whereas 1 × 2 multimode interference (MMI) splitters can be low-loss (< 0.1 dB), wideband, robust, and highly symmetric (power imbalance: < 0.1 dB) [51]. Therefore, we expect the architecture in Fig. 1(a) to have a smaller hardware error than that in Fig. 1(b). As a fair comparison, we analyze the phase quantization error and the splitter-induced error in the matrix encoding region for the architecture in Fig. 1(a) and compare the results with the UUMI. For a real-valued N × N target matrix ${\mathbf{M}}$ with all elements normalized between 0 and 1, the relative error $\varepsilon $ in the actual matrix ${\mathbf{W}}$ can be calculated from the Frobenius norm:

(3)$$\varepsilon = \frac{\left\|{{\mathbf{M}} - {\mathbf{W}}}\right\|}{\left\|{\mathbf{M}}\right\|} = \frac{{\sqrt {\mathop \sum \nolimits_{i,j} {{|{{m_{ij}} - {w_{ij}}} |}^2}} }}{{\sqrt {\mathop \sum \nolimits_{i,j} {{|{{m_{ij}}} |}^2}} }}\; ({\left\|{\mathbf{M}}\right\| > 0} ).$$

We can see that $\varepsilon $ strongly depends on ${\mathbf{M}}$ because $\left\|{\mathbf{M}}\right\|$ is a variable here ($\left\|{\mathbf{M}}\right\| \in [{0,\; N} ]$). For the UUMI, ${\mathbf{M}}$ is unitary and therefore $\left\|{\mathbf{M}}\right\|$ is a constant ($\left\|{\mathbf{M}}\right\| \equiv \sqrt N $).

For a single-input, single-output MZI with imperfect splitters, the power transmittance ${w_{ij}}$ is given by

(4)$$\begin{aligned}{w_{ij}} &= \begin{array}{l} \left| \left( \begin{array}{cc} {\sqrt {\frac{1}{2} + {\beta _{ij}}} }&{\sqrt {\frac{1}{2} - {\beta _{ij}}} } \end{array} \right)\left( {\begin{array}{cc} {{e^{j{\theta _{ij}}}}}&0\\ 0&1 \end{array}} \right)\left( {\begin{array}{c} {\sqrt {\frac{1}{2} + {\alpha _{ij}}} }\\ {\sqrt {\frac{1}{2} - {\alpha _{ij}}} } \end{array}} \right) \right|\\ \; \end{array}^{2}\\&= \frac{1}{2} + 2{\alpha _{ij}}{\beta _{ij}} + 2\cos {\theta _{ij}}\sqrt {\left( {\frac{1}{4} - \alpha_{ij}^2} \right)\left( {\frac{1}{4} - \beta_{ij}^2} \right)} ,\end{aligned}$$

where ${\alpha _{ij}}$ and ${\beta _{ij}}$ represent the deviations of splitting ratio from ideal value (50:50). When only considering the phase quantization error ${\varepsilon _\mathrm{\theta }}$ (assume ${\alpha _{ij}},{\beta _{ij}} = 0$), Eq. (3) becomes

(5)$${\varepsilon _\mathrm{\theta }} = \frac{{\sqrt {\mathop \sum \nolimits_{i,j} {{\left[ {\frac{1}{2}({\cos \theta_{ij}^{\prime} - \cos {\theta_{ij}}} )} \right]}^2}} }}{{\sqrt {\mathop \sum \nolimits_{i,j} {{\left[ {\frac{1}{2}({1 + \cos \theta_{ij}^{\prime}} )} \right]}^2}} }} \approx \sqrt {\frac{{\mathop \sum \nolimits_{ij} \Delta \theta _{ij}^2{{\sin }^2}\theta _{ij}^{\prime}}}{{\mathop \sum \nolimits_{ij} {{({1 + \cos \theta_{ij}^{\prime}} )}^2}}}} \; ({\left\|{\mathbf{M}}\right\| > 0} ),$$

where $\theta _{ij}^{\prime}$ represents the perfect phase for ${m_{ij}}$ and $\Delta {\theta _{ij}} = \theta _{ij}^{\prime} - {\theta _{ij}}$. Intuitively, ${\varepsilon _\mathrm{\theta }}^2$ can be understood as a weighted average of the error at each phase shifter. For a sufficiently large number of instances, the average ${\varepsilon _\mathrm{\theta }}$ should be independent of N and only determined by the phase quantization level. Meanwhile, the variance of ${\varepsilon _\mathrm{\theta }}$, which represents the difference between a sample mean and the expected value, should decrease when the sample size (${N^2}$) increases. Figure 3(a) shows the histograms of ${\varepsilon _\mathrm{\theta }}$ at various phase quantization levels and matrix scales, assuming no splitter-induced errors. In each case, we use 2500 randomly generated matrices as the target matrices (${\mathbf{M}}$), where each matrix element is sampled from a uniform distribution in [0, 1]. We can see that the average ${\varepsilon _\mathrm{\theta }}$ is almost unchanged while the variance decreases when N is increased from 4 to 64, which verifies our intuitive guess. For the splitter-induced error ${\varepsilon _\textrm{s}}$ (assume no phase errors), since a single MZI induces an error:

(6)$$|{{m_{ij}} - {w_{ij}}} |= \left|{2{\alpha_{ij}}{\beta_{ij}} + \cos {\theta_{ij}}\left( {2\sqrt {\left( {\frac{1}{4} - \alpha_{ij}^2} \right)\left( {\frac{1}{4} - \beta_{ij}^2} \right)} - \frac{1}{2}} \right)} \right|,$$

${\varepsilon _\textrm{s}}$ can be obtained by substituting Eq. (6) into Eq. (3). Figure 3(b) shows the histograms of ${\varepsilon _\textrm{s}}$ at various deviation levels in the splitting ratio and matrix scales, assuming no phase quantization errors. Here, ${\alpha _{ij}}$ and ${\beta _{ij}}$ are assumed to be independent and to follow the same normal distribution $\mathrm{{\cal N}}({0,{\sigma^2}} )$ ($\sigma $ is the standard deviation). We can see that the average and variance of ${\varepsilon _\textrm{s}}$ behave similarly to ${\varepsilon _\mathrm{\theta }}$ when N is varied.

Fig. 3. (a) Phase quantization errors of 2500 instances at various quantization levels (10, 12, 14 bit) and matrix scales, assuming no splitter-induced errors. (b) Splitter-induced errors of 2500 instances at various deviation levels in the splitting ratio and matrix scales, assuming no phase quantization errors. The deviation of each splitting ratio is sampled from the normal distribution $\mathrm{{\cal N}}({0,{\sigma^2}} )$, where σ is the standard deviation. (c) Hardware errors of this architecture and the universal unitary multiport interferometer (UUMI). Each point represents the mean error of 2500 instances, and the error band indicates the range between the minimum and maximum error.

Download Full Size | PDF

For the UUMI, it has been shown in [18] that both ${\varepsilon _\mathrm{\theta }}$ and ${\varepsilon _\textrm{s}}$ scale in proportion with $\sqrt N $. In Fig. 3(c), we plot the hardware errors of our architecture and the UUMI as a function of the number of MZIs, where each point represents the mean error of 2500 instances and the error band indicates the range between the minimum and maximum error. For the UUMI, 2500 randomly generated unitary matrices are used as the target matrices for each device scale. It is obvious that our MVM architecture has an intrinsically smaller hardware error than UUMI, and the error does not increase with the device scale. This can be intuitively understood as the error only affects a single MZI (thus a single matrix element) in our architecture, while for the UUMI, an error induced by one MZI affects all the following MZI stages.

The unbalanced insertion loss between different waveguides can bring additional error. In our architectures, we can either equalize the path lengths by adding additional lengths to shorter paths or adjust the transimpedance gain of each PD to minimize the unbalanced-loss-induced error (denoted as ${\varepsilon _\mathrm{\alpha }}$). Here, we assume that the transmittance of each path ${t_{ij}}$ follows the normal distribution $\mathrm{{\cal N}}({1,\sigma_\mathrm{\alpha }^2} )$ to exclude the average attenuation factor, where ${\sigma _\mathrm{\alpha }}$ represents the relative standard deviation. Then each element of the actual matrix ${w_{ij}}$ is simply given by ${t_{ij}}{m_{ij}}$. Using the same 2500 random instances of W, ${\varepsilon _\mathrm{\alpha }}$ at various levels of ${\sigma _\mathrm{\alpha }}$ are calculated and shown in Fig. 4(a). If ${\sigma _\mathrm{\alpha }}$ is relatively large, ${\varepsilon _\mathrm{\alpha }}$ may exceed the phase quantization error and the splitter-induced error and become the largest error component in the MVM architecture.

Fig. 4. (a) Unbalanced-loss-induced error ε_α in the MVM architecture at various levels of loss imbalance. Each point represents the mean value of 2500 instances (2500 random W). (b) Wavelength-induced error ε_λ in the GEMM architecture at various crosstalk levels. Each point represents the mean value of 2500 instances (50 random W × 50 random X), where each element in W and X is randomly sampled from the uniform distribution in [0, 1].

Download Full Size | PDF

As for the architecture for GEMM [Fig. 1(c)], an additional error is the wavelength-induced error ${\varepsilon _\mathrm{\lambda }}$, which can occur at the modulators for matrix W and the following wavelength demultiplexers. The wavelength dependency of an optical modulator can be sufficiently small within a wide wavelength range, if MEMS or hybrid MOS phase shifters are used. Therefore, the wavelength demultiplexer should be a larger error source than the optical modulator. In a wavelength demultiplexer, one wavelength channel tends to have a similar level of crosstalk to all other channels, so the wavelength-induced error increases with more wavelengths (the column number of ${\mathbf{X}}$). An optical bandpass filter may be needed to reduce the wavelength crosstalk. Assuming broadband optical modulators with a negligible wavelength dependency are used, performing ${{\mathbf{W}}_{{\boldsymbol{N}} \times {\boldsymbol{N}}}}{{\mathbf{X}}_{{\boldsymbol{N}} \times {\boldsymbol{M}}}}$ with this architecture yields the result ${{\mathbf{Y}}_{{\boldsymbol{N}} \times {\boldsymbol{M}}}}$, where

(7)$${y_{ij}} = {{\boldsymbol{w}}_{\boldsymbol{i}}}{{\boldsymbol{x}}_{\boldsymbol{j}}} + \mathop \sum \limits_{m = 1({m \ne j} )}^M {\kappa _{jm}}{{\boldsymbol{w}}_{\boldsymbol{i}}}{{\boldsymbol{x}}_{\boldsymbol{m}}}.$$

Here, M represents the number of wavelengths, ${{\boldsymbol{w}}_{\boldsymbol{i}}}$ is the i-th row vector of ${{\mathbf{W}}_{{\boldsymbol{N}} \times {\boldsymbol{N}}}}$, ${{\boldsymbol{x}}_j}$ is the j-th column vector of ${{\mathbf{X}}_{{\boldsymbol{N}} \times {\boldsymbol{M}}}}$, and ${\kappa _{jm}}$ represents the crosstalk from the wavelength channel m to j. Then, the wavelength-induced error ${\varepsilon _\mathrm{\lambda }}$ can be calculated by substituting Eq. (7) into

(8)$$\varepsilon = \frac{{{||\mathbf{WX}} - {\mathbf{Y}||}}}{{||{\mathbf{WX}||}}}.$$

For simplicity, we can assume that ${\kappa _{jm\; ({m \ne j} )}} = \kappa $ for all wavelength channels, then ${\varepsilon _\mathrm{\lambda }}$ simplifies into

(9)$${\varepsilon _\mathrm{\lambda }} = \kappa \sqrt {\frac{{\mathop \sum \nolimits_{i,j} {{\left( {\mathop \sum \nolimits_{m = 1}^M {{\boldsymbol{w}}_{\boldsymbol{i}}}{{\boldsymbol{x}}_{\boldsymbol{m}}} - {{\boldsymbol{w}}_{\boldsymbol{i}}}{{\boldsymbol{x}}_{\boldsymbol{j}}}} \right)}^2}}}{{\mathop \sum \nolimits_{i,j} {{({{{\boldsymbol{w}}_{\boldsymbol{i}}}{{\boldsymbol{x}}_{\boldsymbol{j}}}} )}^2}}}} .$$

Figure 4(b) shows the calculated ${\varepsilon _\mathrm{\lambda }}$ at various crosstalk levels. Each point represents the mean value of 2500 instances (50 random ${\mathbf{W}}$ × 50 random ${\mathbf{X}}$), where each element in ${\mathbf{W}}$ and ${\mathbf{X}}$ is randomly sampled from the uniform distribution in [0, 1]. We can see that the wavelength-induced error can easily exceed the phase quantization error and splitter-induced error, becoming the largest error source. Therefore, it is vital to reduce the crosstalk between different wavelength channels to have a small hardware error. A relevant analysis on the wavelength-induced error is also given in [68].

5. Energy efficiency of GEMM

An obvious advantage of the GEMM is the throughput since multiple vectors can be processed simultaneously. In addition, the advantage of GEMM over parallel and sequential MVMs can be seen by comparing the operations per second per watt (OPS/W), which is a common measure for energy efficiency that considers both the throughput and the power consumption. Here, the number of operations is considered to be the same as the total number of input vectors. Therefore, within one cycle, the sequential MVM performs one operation, while the parallel MVM and the GEMM perform multiple operations. For the calculation of ${{\mathbf{W}}_{N \times N}}{{\mathbf{X}}_{N \times M}}$, our architecture for GEMM requires ${N^2} + NM$ optical modulators and $NM$ multiport PDs, while architectures for MVM require ${N^2}$ optical modulators for matrix encoding, N optical modulators for vector encoding, and N PDs (implemented M times). Note that in the architectures for MVM, the matrix-encoding modulators can work at a lower frequency (1/M) than the vector-encoding modulators. Therefore, the digital-to-analog converters (DACs) for matrix-encoding modulators can operate at a slower update rate than the DACs for vector-encoding modulators. Suppose that each modulator is driven by an individual DAC and each PD is read out by an individual analog-to-digital converter (ADC), the improvement factor in OPS/W of the architecture for GEMM can be approximated by

(10)$$\eta = \frac{{M{\textrm{P}_{\textrm{MVM}}}}}{{{\textrm{P}_{\textrm{GEMM}}}}} \approx M\frac{{{N^2}{\textrm{P}_{\textrm{DAC} - \textrm{l}}} + N{\textrm{P}_{\textrm{DAC} - \textrm{h}}} + N{\textrm{P}_{\textrm{ADC}}} + N{\textrm{P}_{\textrm{ph}}}}}{{({{N^2} + NM} ){\textrm{P}_{\textrm{DAC} - \textrm{h}}} + NM{\textrm{P}_{\textrm{ADC}}} + NM\textrm{P}_{\textrm{ph}}^{\prime}}},$$

where ${\textrm{P}_{\textrm{MVM}}}$, ${\textrm{P}_{\textrm{GEMM}}}$, ${\textrm{P}_{\textrm{DAC} - \textrm{h}}}$, ${\textrm{P}_{\textrm{DAC} - \textrm{l}}}$, ${\textrm{P}_{\textrm{ADC}}}$ represent the power consumption of a MVM device, a GEMM device, a DAC operating at the higher update rate (${f_\textrm{r}}$), a DAC operating at the lower update rate (${f_\textrm{r}}/M$), an ADC operating at the sampling rate of ${f_\textrm{r}}$, respectively, ${\textrm{P}_{\textrm{ph}}}$ and $\textrm{P}_{\textrm{ph}}^{\prime}$ represent the required optical power per PD in a MVM and GEMM device, respectively. Since the dominant power consumption comes from the electronics, we first assume an ideal situation in which the insertion loss of the wavelength multiplexer is ignored. Using the parameters listed in Table 2 (see more details on the power consumption of DACs and ADCs in Supplement 1), $\eta $ at various conditions are calculated and shown in Fig. 5(a). Moreover, for batch processing in deep learning, the batch size B, which represents the total number of vectors to be processed, can be set larger than M to further improve the energy efficiency, since the DACs for matrix ${\mathbf{W}}$ can now operate at a lower update rate (${f_\textrm{r}}M/B$). In Fig. 5(b), we show the influence of batch size to $\eta $ in a more realistic situation, in which the upper limit of N is set to 64 and the 4-wavelength multiplexer with an insertion loss of 1.5 dB is used [55]. The optical power in the GEMM device is increased accordingly to compensate for the extra 3 dB power loss. As shown in Fig. 5(b), with 4 wavelengths, it is possible to obtain an improvement factor greater than 2 for N > 20.

Fig. 5. Improvement factor of GEMM over parallel and sequential MVMs with respect to operations per second per watt (OPS/W). (a) Ideal situations where the insertion loss of wavelength multiplexers is ignored. The batch size is equal to M. (b) A more realistic situation where 4-wavelength multiplexers with an insertion loss of 1.5 dB are used. The optical power is increased accordingly to compensate for the extra loss of wavelength multiplexers. The batch size B is an integer multiple of M.

Download Full Size | PDF

Table 2. Parameters used in the calculation of η

View Table | View all tables in this article

Our conclusion that the GEMM has a higher energy efficiency than MVMs also applies to many other schemes, such as the coherent linear optical processor, if they are integrated with the crossing-free on-chip wavelength multiplexers using the two-layer waveguide structure proposed here. The throughput and power consumption are mainly limited by the electronics such as DACs and ADCs, according to the analysis in a previous work [7]. For the same matrix scale, our MVM architecture requires the same number of DACs and ADCs as the coherent linear optical processor, therefore, the throughput and the power consumption of electronics should be on the same level for both architectures. While our architecture may require a higher optical power due to a higher modulation loss for some matrices, the associated decrease in the energy efficiency is slight because the dominant power consumption is from the electronics.

6. Discussion

Large-scale matrix multiplications require a significant number of optical modulators, especially for GEMM. Although 64 × 64 devices for MVM have been demonstrated recently [14], a further scaling up is still challenging. In our architectures, this issue can be alleviated by dividing the device into multiple modules and placing each module on a single die. For example, in the architecture for GEMM, one or several optical modulator arrays for matrix ${\mathbf{W}}$ and the associated multiport PDs can form a module, which receives the optical signals from the module for matrix ${\mathbf{X}}$ via low-cost passive optical interconnects, such as the alignment-free photonic interconnect [69]. The optical insertion loss slightly increases in such a multi-module device, but since the optical power is negligible compared with the electronic power [70], the decrease in overall energy efficiency is insignificant. Note that the same method is hard to be employed for coherent linear optical processors, due to the difficulty in controlling the phase change in the interconnects. From the viewpoint of signal-to-noise ratio (SNR), the analysis in Supplement 1 shows that a 64 × 64 multi-module GEMM device using 4 wavelengths is possible. For larger-scale devices, the insertion loss needs be further reduced to obtain reasonable SNRs.

Compared with the coherent linear optical processor which supports complex numbers, a drawback of our scheme is that in the current form, only non-negative real numbers (normalized within [0, 1]) can be implemented as the matrix/vector element. However, this may not be a serious problem since most existing coherent linear optical processors are only used for real-valued MVMs. In our architectures, negative real numbers can also be implemented by slightly adapting the original structure. Figure 6 shows an adapted modulator array for one matrix row, where ${w_{ij}}$ now represents the transmittance from the MZI input to one output port. It is easy to see that subtracting the output currents of the two multiport PDs yields

(11)$${\textrm{I}_1} - {\textrm{I}_2} = \mathop \sum \limits_{j = 0}^N {x_j}({2{w_{ij}} - 1} ).$$

Since ${w_{ij}} \in [{0,1} ]$, real numbers between -1 and 1 now can be implemented by $2{w_{ij}} - 1$. This adaption also eliminates the modulation loss caused by matrix ${\mathbf{W}}$.

Fig. 6. Adapting the original structure to implement negative real numbers. A double number of multiport PDs are needed for this purpose.

Download Full Size | PDF

7. Conclusion

We have proposed novel integrated photonic architectures for MVM and GEMM, respectively, based on a two-layer waveguide platform. Compared with previous architectures for MVM, our architecture has an intrinsically smaller hardware error, and the error does not increase with the device scale, which is crucial for large-scale matrix multiplications. The architecture for GEMM allows GEMM to be directly performed on a photonic chip, with a high energy efficiency unattainable by parallel or sequential MVMs. This work provides a promising approach to realize a high fidelity and high energy efficiency optical computing platform.

Funding

Japan Science and Technology Agency (JST) CREST (JPMJCR2004); Japan Society for the Promotion of Science (JSPS) KAKENHI (22K14298).

Acknowledgments

R. Tang thanks Ziqiang Zhao, Hanzhi Tang, and Yuto Miyatake for fruitful discussions.

Disclosures

The authors are applying for a patent relating to the content of this paper.

Data availability

Data underlying the results presented in this paper are available from the corresponding authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

References

1. E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh, and D. Marr, “Accelerating binarized neural networks: comparison of FPGA, CPU, GPU, and ASIC,” in 2016 International Conference on Field-Programmable Technology (FPT) (IEEE, 2016), pp. 77–84.

2. Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljačić, “Deep learning with coherent nanophotonic circuits,” Nat. Photonics 11(7), 441–446 (2017). [CrossRef]

3. T. F. de Lima, H.-T. Peng, A. N. Tait, M. A. Nahmias, H. B. Miller, B. J. Shastri, and P. R. Prucnal, “Machine learning with neuromorphic photonics,” J. Lightwave Technol. 37(5), 1515–1534 (2019). [CrossRef]

4. R. Hamerly, L. Bernstein, A. Sludds, M. Soljačić, and D. Englund, “Large-scale optical neural networks based on photoelectric multiplication,” Phys. Rev. X 9(2), 021032 (2019). [CrossRef]

5. G. Wetzstein, A. Ozcan, S. Gigan, S. Fan, D. Englund, M. Soljačić, C. Denz, D. A. B. Miller, and D. Psaltis, “Inference in artificial intelligence with deep optics and photonics,” Nature 588(7836), 39–47 (2020). [CrossRef]

6. J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice, and H. Bhaskaran, “Parallel convolutional processing using an integrated photonic tensor core,” Nature 589(7840), 52–58 (2021). [CrossRef]

7. C. Demirkiran, F. Eris, G. Wang, J. Elmhurst, N. Moore, N. C. Harris, A. Basumallik, V. J. Reddi, A. Joshi, and D. Bunandar, “An electro-photonic system for accelerating deep neural networks,” arXiv210901126 Cs (2021).

8. S. Xu, J. Wang, H. Shu, Z. Zhang, S. Yi, B. Bai, X. Wang, J. Liu, and W. Zou, “Optical coherent dot-product chip for sophisticated deep learning regression,” Light: Sci. Appl. 10(1), 221 (2021). [CrossRef]

9. H. Zhou, J. Dong, J. Cheng, W. Dong, C. Huang, Y. Shen, Q. Zhang, M. Gu, C. Qian, H. Chen, Z. Ruan, and X. Zhang, “Photonic matrix multiplication lights up photonic accelerator and beyond,” Light: Sci. Appl. 11(1), 30 (2022). [CrossRef]

10. C. Taballione, T. A. W. Wolterink, J. Lugani, A. Eckstein, B. A. Bell, R. Grootjans, I. Visscher, D. Geskus, C. G. H. Roeloffzen, J. J. Renema, I. A. Walmsley, P. W. H. Pinkse, and K.-J. Boller, “8×8 reconfigurable quantum photonic processor based on silicon nitride waveguides,” Opt. Express 27(19), 26842 (2019). [CrossRef]

11. S. Pai, B. Bartlett, O. Solgaard, and D. A. B. Miller, “Matrix optimization on universal unitary photonic devices,” Phys. Rev. Appl. 11(6), 064044 (2019). [CrossRef]

12. R. Tanomura, R. Tang, S. Ghosh, T. Tanemura, and Y. Nakano, “Robust integrated optical unitary converter using multiport directional couplers,” J. Lightwave Technol. 38(1), 60–66 (2020). [CrossRef]

13. F. Shokraneh, S. Geoffroy-gagnon, and O. Liboiron-Ladouceur, “The diamond mesh, a phase-error- and loss-tolerant field-programmable MZI-based optical processor for optical neural networks,” Opt. Express 28(16), 23495 (2020). [CrossRef]

14. C. Ramey, “Silicon photonics for artificial intelligence accelerationa : HotChips 32,” in 2020 IEEE Hot Chips 32 Symposium (HCS) (IEEE, 2020), pp. 1–26.

15. C. Taballione, R. van der Meer, H. J. Snijders, P. Hooijschuur, J. P. Epping, M. de Goede, B. Kassenberg, P. Venderbosch, C. Toebes, H. van den Vlekkert, P. W. H. Pinkse, and J. J. Renema, “A universal fully reconfigurable 12-mode quantum photonic processor,” Mater. Quantum Technol. 1(3), 035002 (2021). [CrossRef]

16. H. Zhang, M. Gu, X. D. Jiang, J. Thompson, H. Cai, S. Paesani, R. Santagati, A. Laing, Y. Zhang, M. H. Yung, Y. Z. Shi, F. K. Muhammad, G. Q. Lo, X. S. Luo, B. Dong, D. L. Kwong, L. C. Kwek, and A. Q. Liu, “An optical neural chip for implementing complex-valued neural network,” Nat. Commun. 12(1), 457 (2021). [CrossRef]

17. R. Tang, R. Tanomura, T. Tanemura, and Y. Nakano, “Ten-port unitary optical processor on a silicon photonic chip,” ACS Photonics 8(7), 2074–2080 (2021). [CrossRef]

18. S. Bandyopadhyay, R. Hamerly, and D. Englund, “Hardware error correction for programmable photonics,” Optica 8(10), 1247 (2021). [CrossRef]

19. R. Hamerly, S. Bandyopadhyay, and D. Englund, “Accurate self-configuration of rectangular multiport interferometers,” Phys. Rev. Appl. 18(2), 024019 (2022). [CrossRef]

20. R. Hamerly, S. Bandyopadhyay, and D. Englund, “Stability of self-configuring large multiport interferometers,” Phys. Rev. Appl. 18(2), 024018 (2022). [CrossRef]

21. R. Hamerly, S. Bandyopadhyay, and D. Englund, “Infinitely scalable multiport interferometers,” arXiv210905367 Phys. (2021).

22. R. Tanomura, R. Tang, T. Umezaki, G. Soma, T. Tanemura, and Y. Nakano, “Scalable and robust photonic integrated unitary converter based on multiplane light conversion,” Phys. Rev. Appl. 17(2), 024071 (2022). [CrossRef]

23. D. A. B. Miller, “Self-configuring universal linear optical component,” Photonics Res. 1(1), 1 (2013). [CrossRef]

24. D. A. B. Miller, “Waves, modes, communications, and optics: a tutorial,” Adv. Opt. Photonics 11(3), 679 (2019). [CrossRef]

25. M. Reck, A. Zeilinger, H. J. Bernstein, and P. Bertani, “Experimental realization of any discrete unitary operator,” Phys. Rev. Lett. 73(1), 58–61 (1994). [CrossRef]

26. J. Carolan, C. Harrold, C. Sparrow, E. Martín-López, N. J. Russell, J. W. Silverstone, P. J. Shadbolt, N. Matsuda, M. Oguma, M. Itoh, G. D. Marshall, M. G. Thompson, J. C. F. Matthews, T. Hashimoto, J. L. O’Brien, and A. Laing, “Universal linear optics,” Science 349(6249), 711–716 (2015). [CrossRef]

27. W. R. Clements, P. C. Humphreys, B. J. Metcalf, W. S. Kolthammer, and I. A. Walsmley, “Optimal design for universal multiport interferometers,” Optica 3(12), 1460 (2016). [CrossRef]

28. M. Moralis-Pegios, G. Mourgias-Alexandris, A. Tsakyridis, G. Giamougiannis, A. Totovic, G. Dabos, N. Passalis, M. Kirtas, T. Rutirawut, F. Y. Gardes, A. Tefas, and N. Pleros, “Neuromorphic silicon photonics and hardware-aware deep learning for high-speed inference,” J. Lightwave Technol. 40(10), 3243–3254 (2022). [CrossRef]

29. G. Dabos, D. V. Bellas, R. Stabile, M. Moralis-Pegios, G. Giamougiannis, A. Tsakyridis, A. Totovic, E. Lidorikis, and N. Pleros, “Neuromorphic photonic technologies and architectures: scaling opportunities and performance frontiers [Invited],” Opt. Mater. Express 12(6), 2343 (2022). [CrossRef]

30. A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Broadcast and weight: an integrated network for scalable photonic spike processing,” J. Lightwave Technol. 32(21), 4029–4041 (2014). [CrossRef]

31. A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shastri, M. A. Nahmias, and P. R. Prucnal, “Microring weight banks,” IEEE J. Sel. Top. Quantum Electron. 22(6), 312–325 (2016). [CrossRef]

32. A. N. Tait, T. F. de Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Neuromorphic photonic networks using silicon photonic weight banks,” Sci. Rep. 7(1), 7430 (2017). [CrossRef]

33. A. N. Tait, H. Jayatilleka, T. F. De Lima, P. Y. Ma, M. A. Nahmias, B. J. Shastri, S. Shekhar, L. Chrostowski, and P. R. Prucnal, “Feedback control for microring weight banks,” Opt. Express 26(20), 26422 (2018). [CrossRef]

34. S. Ohno, K. Toprasertpong, S. Takagi, and M. Takenaka, “Si microring resonator crossbar arrays for deep learning accelerator,” Jpn. J. Appl. Phys. 59(SG), SGGE04 (2020). [CrossRef]

35. C. Huang, S. Bilodeau, T. Ferreira de Lima, A. N. Tait, P. Y. Ma, E. C. Blow, A. Jha, H.-T. Peng, B. J. Shastri, and P. R. Prucnal, “Demonstration of scalable microring weight bank control for large-scale photonic integrated circuits,” APL Photonics 5(4), 040803 (2020). [CrossRef]

36. W. Zhang, C. Huang, H.-T. Peng, S. Bilodeau, A. Jha, E. Blow, T. F. de Lima, B. J. Shastri, and P. Prucnal, “Silicon microring synapses enable photonic deep learning beyond 9-bit precision,” Optica 9(5), 579 (2022). [CrossRef]

37. S. Ohno, R. Tang, K. Toprasertpong, S. Takagi, and M. Takenaka, “Si microring resonator crossbar array for on-chip inference and training of the optical neural network,” ACS Photonics 9(8), 2614–2622 (2022). [CrossRef]

38. D. J. M. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H. W. Leong, “A customizable matrix multiplication framework for the Intel HARPv2 Xeon + FPGA platform: a deep learning case study,” in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ACM, 2018), pp. 107–116.

39. L. Bernstein, A. Sludds, R. Hamerly, V. Sze, J. Emer, and D. Englund, “Freely scalable and reconfigurable optical hardware for deep learning,” Sci. Rep. 11(1), 3144 (2021). [CrossRef]

40. A. R. Dias, “Incoherent optical matrix-matrix multiplier,” in NASA. Langley Research Center Opt. Inform. Process. for Aerospace Appl. (1981).

41. R. A. Athale and W. C. Collins, “Optical matrix–matrix multiplier based on outer product decomposition,” Appl. Opt. 21(12), 2089 (1982). [CrossRef]

42. Y.-Z. Liang and H. K. Liu, “Optical matrix–matrix multiplication method demonstrated by the use of a multifocus hololens,” Opt. Lett. 9(8), 322 (1984). [CrossRef]

43. B. H. Soffer, Y. Owechko, E. Marom, and J. Grinberg, “Programmable real-time incoherent matrix multiplier for optical processing,” Appl. Opt. 25(14), 2295 (1986). [CrossRef]

44. A. Totovic, C. Pappas, M. Kirtas, A. Tsakyridis, G. Giamougiannis, N. Passalis, M. Moralis-Pegios, A. Tefas, and N. Pleros, “WDM equipped universal linear optics for programmable neuromorphic photonic processors,” Neuromorphic Comput. Eng. 2(2), 024010 (2022). [CrossRef]

45. H. Li, W. Chen, P. Wang, S. Dai, Y. Liu, Q. Fu, J. Li, Y. Li, T. Dai, H. Yu, and J. Yang, “Compact and low-loss 1 × 3 polarization-insensitive optical power splitter using cascaded tapered silicon waveguides,” Opt. Lett. 45(19), 5596 (2020). [CrossRef]

46. S. Wu, X. Mu, L. Cheng, S. Mao, and H. Y. Fu, “State-of-the-art and perspectives on silicon waveguide crossings: a review,” Micromachines 11(3), 326 (2020). [CrossRef]

47. W. D. Sacher, J. C. Mikkelsen, P. Dumais, J. Jiang, D. Goodwill, X. Luo, Y. Huang, Y. Yang, A. Bois, P. G.-Q. Lo, E. Bernier, and J. K. S. Poon, “Tri-layer silicon nitride-on-silicon photonic platform for ultra-low-loss crossings and interlayer transitions,” Opt. Express 25(25), 30862 (2017). [CrossRef]

48. Y. Zhang, A. Samanta, K. Shang, and S. J. B. Yoo, “Scalable 3D silicon photonic electronic integrated circuits and their applications,” IEEE J. Sel. Top. Quantum Electron. 26, 1–10 (2020). [CrossRef]

49. N. M. Fahrenkopf, C. McDonough, G. L. Leake, Z. Su, E. Timurdogan, and D. D. Coolbaugh, “The AIM Photonics MPW: a highly accessible cutting edge technology for rapid prototyping of photonic integrated circuits,” IEEE J. Sel. Top. Quantum Electron. 25(5), 1–6 (2019). [CrossRef]

50. K. Suzuki, S. Namiki, H. Kawashima, K. Ikeda, R. Konoike, N. Yokoyama, M. Seki, M. Ohtsuka, S. Saitoh, S. Suda, H. Matsuura, and K. Yamada, “Nonduplicate polarization-diversity 32 × 32 silicon photonics switch based on a SiN/Si double-layer platform,” J. Lightwave Technol. 38(2), 226–232 (2020). [CrossRef]

51. S. Y. Siew, B. Li, F. Gao, H. Y. Zheng, W. Zhang, P. Guo, S. W. Xie, A. Song, B. Dong, L. W. Luo, C. Li, X. Luo, and G.-Q. Lo, “Review of silicon photonics technology and platform development,” J. Lightwave Technol. 39(13), 4374–4389 (2021). [CrossRef]

52. J. Sun, E. Timurdogan, A. Yaacobi, E. S. Hosseini, and M. R. Watts, “Large-scale nanophotonic phased array,” Nature 493(7431), 195–199 (2013). [CrossRef]

53. J. Chiles, S. M. Buckley, S. W. Nam, R. P. Mirin, and J. M. Shainline, “Design, fabrication, and metrology of 10 × 100 multi-planar integrated photonic routing manifolds for neural networks,” APL Photonics 3(10), 106101 (2018). [CrossRef]

54. X. Hu, D. Wu, H. Zhang, W. Li, D. Chen, L. Wang, X. Xiao, and S. Yu, “High-speed lateral PIN germanium photodetector with 4-directional light input,” Opt. Express 28(25), 38343 (2020). [CrossRef]

55. D. Melati, P. G. Verly, A. Delâge, S. Wang, J. Lapointe, P. Cheben, J. H. Schmid, S. Janz, and D.-X. Xu, “Compact and low crosstalk echelle grating demultiplexer on silicon-on-insulator technology,” Electronics 8(6), 687 (2019). [CrossRef]

56. A. Y. Piggott, J. Lu, K. G. Lagoudakis, J. Petykiewicz, T. M. Babinec, and J. Vučković, “Inverse design and demonstration of a compact and broadband on-chip wavelength demultiplexer,” Nat. Photonics 9(6), 374–377 (2015). [CrossRef]

57. L. Su, A. Y. Piggott, N. V. Sapra, J. Petykiewicz, and J. Vučković, “Inverse design and demonstration of a compact on-chip narrowband three-channel wavelength demultiplexer,” ACS Photonics 5(2), 301–305 (2018). [CrossRef]

58. J.-H. Han, F. Boeuf, J. Fujikata, S. Takahashi, S. Takagi, and M. Takenaka, “Efficient low-loss InGaAsP/Si hybrid MOS optical modulator,” Nat. Photonics 11(8), 486–490 (2017). [CrossRef]

59. M. Takenaka, S. Takahashi, S. Takagi, J.-H. Han, F. Boeuf, J.-K. Park, Q. Li, C. P. Ho, D. Lyu, S. Ohno, and J. Fujikata, “III–V/Si hybrid MOS optical phase shifter for Si photonic integrated circuits,” J. Lightwave Technol. 37(5), 1474–1483 (2019). [CrossRef]

60. H. Sattari, T. Graziosi, M. Kiss, T. J. Seok, S. Han, M. C. Wu, and N. Quack, “Silicon photonic MEMS phase-shifter,” Opt. Express 27(13), 18959 (2019). [CrossRef]

61. C. Errando-Herranz, A. Y. Takabayashi, P. Edinger, H. Sattari, K. B. Gylfason, and N. Quack, “MEMS for photonic integrated circuits,” IEEE J. Sel. Top. Quantum Electron. 26(2), 1–16 (2020). [CrossRef]

62. W. Bogaerts, A. Y. Takabayashi, P. Edinger, I. Zand, G. Jo, H. Sattari, P. Verheyen, M. A. Jezzini, C. Antony, G. Talli, M. Saei, S. Kumar, C. L. Arce, M. G. Porcel, N. Quack, K. B. Gylfason, F. Niklaus, and U. Khan, “Programmable photonic circuits using silicon photonic MEMS,” in Advanced Photonics Congress (OSA, 2021), paper IM2A.1.

63. Z. Su, E. S. Hosseini, E. Timurdogan, J. Sun, M. Moresco, G. Leake, T. N. Adam, D. D. Coolbaugh, and M. R. Watts, “Whispering gallery germanium-on-silicon photodetector,” Opt. Lett. 42(15), 2878 (2017). [CrossRef]

64. J. Kang, S. Takagi, and M. Takenaka, “Ge photodetector monolithically integrated with amorphous Si waveguide on wafer-bonded Ge-on-insulator substrate,” Opt. Express 26(23), 30546 (2018). [CrossRef]

65. B. Son, Y. Lin, K. H. Lee, Y. Wang, S. Wu, and C. S. Tan, “High speed and ultra-low dark current Ge vertical p-i-n photodetectors on an oxygen-annealed Ge-on-insulator platform with GeOx surface passivation,” Opt. Express 28(16), 23978 (2020). [CrossRef]

66. F. Ashtiani, A. J. Geers, and F. Aflatouni, “An on-chip photonic deep neural network for image classification,” Nature 606(7914), 501–506 (2022). [CrossRef]

67. L. Colace, G. Assanto, D. Fulgoni, and L. Nash, “Near-infrared p-i-n Ge-on-Si photodiodes for silicon integrated receivers,” J. Lightwave Technol. 26(16), 2954–2959 (2008). [CrossRef]

68. A. Totovic, G. Giamougiannis, A. Tsakyridis, D. Lazovsky, and N. Pleros, “Programmable photonic neural networks combining WDM with coherent linear optics,” Sci. Rep. 12(1), 5605 (2022). [CrossRef]

69. S. Bandyopadhyay and D. Englund, “Alignment-free photonic interconnects,” arXiv211012851 Phys. (2021).

70. A. N. Tait, “Quantifying power in silicon photonic neural networks,” Phys. Rev. Appl. 17(5), 054029 (2022). [CrossRef]

Parameter	Description	Value	Reference
$w$	Waveguide width	400 nm
$g$	Waveguide gap	200 nm
$J_{dark}$	Dark current density	1 mA/µm²	[65,67]
$f_{T}$	Carrier transit-limited bandwidth	72 GHz	[65]
$R_{l}$	Load resistance	50 Ω
$R_{s}$	Series resistance	19 Ω	[65]
$C_{p}$	Parasitic capacitance	0.13 pF	[65]
$C_{j} / S$	Junction capacitance versus detector area	3.75 × 10⁻⁴ pF/µm²	[65]

Parameter	Description	Value
$f_{r}$	Update/sampling rate of DACs/ADCs	1 Gsps
$P_{DAC - h}$	Power consumption of a DAC operating at $f_{r}$	0.4 W
$P_{ADC}$	Power consumption of an ADC operating at $f_{r}$	0.96 W
$Δ P_{DAC} / Δ f_{r}$	Slope of $P_{DAC}$ versus $f_{r}$	0.16 W/Gsps
$P_{ph}$	Required optical power per PD	1 mW

Parameter	Description	Value	Reference
$w$	Waveguide width	400 nm
$g$	Waveguide gap	200 nm
$J_{dark}$	Dark current density	1 mA/µm²	[65,67]
$f_{T}$	Carrier transit-limited bandwidth	72 GHz	[65]
$R_{l}$	Load resistance	50 Ω
$R_{s}$	Series resistance	19 Ω	[65]
$C_{p}$	Parasitic capacitance	0.13 pF	[65]
$C_{j} / S$	Junction capacitance versus detector area	3.75 × 10⁻⁴ pF/µm²	[65]

Parameter	Description	Value
$f_{r}$	Update/sampling rate of DACs/ADCs	1 Gsps
$P_{DAC - h}$	Power consumption of a DAC operating at $f_{r}$	0.4 W
$P_{ADC}$	Power consumption of an ADC operating at $f_{r}$	0.96 W
$Δ P_{DAC} / Δ f_{r}$	Slope of $P_{DAC}$ versus $f_{r}$	0.16 W/Gsps
$P_{ph}$	Required optical power per PD	1 mW

Two-layer integrated photonic architectures with multiport photodetectors for high-fidelity and energy-efficient matrix multiplications

Abstract

1. Introduction

2. Architectures

3. Multiport photodetector

4. Hardware error

5. Energy efficiency of GEMM

6. Discussion

7. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (6)

Tables (2)

Equations (11)

Optics Express