RGAIA: a reconfigurable AWGR based optical data center network

Jinzhe Che; Bingli Guo; Zichong Pang; Xuwei Xue; Yisong Zhao; Yuanzhi Guo; Shanguo Huang

doi:10.1364/OE.457527

1. Introduction

The rapid development of data-rich applications, such as cloud computing, Internet of things and the up-coming 5G related service, is pushing data centers (DCs) to feature an even higher performance [1]. To settle the enormous traffic increase within DCs, the data center network (DCN) is under the requirement of high bandwidth and throughput, low latency and packet loss [2–4]. Considering the scalability and dimension of current DCNs, the switching layer must support tens of Tb/s speed even for oversubscribed topology [5]. However, electrical switches are encountering the bandwidth bottleneck, mainly resulting from scaling issues of ball grid array (BGA) packaging. Conventional method in stacking ASIC boards can improve the bandwidth, but such multi-tier electrical switches introduce extra interference and latency, leading to the performance deterioration and high routing complexity [6,7]. Moreover, conventional electrical DCNs process and forward data in electrical and optical domain, respectively, leading to the massive deployment of power consuming O-E-O converters at switches [8]. Benefitting from the transparence of data format and rate, switching packets directly in optical domain with high bandwidth has been considered as a feasible solution for overcoming aforementioned challenges. Ultra-high optical bandwidth avoids multi-tier architecture to ameliorate the network performance and the elimination of O-E-O conversions dramatically decreases the power consumption and diminishes the switching delay.

Arrayed waveguide grating router (AWGR) combing with the tunable transceiver can work as the fast optical switch with multiple input and output ports. Without mechanical or electrical parts, the passive nature of AWGR brings robust performance during the operation. Benefitting from the aforementioned advantages, several AWGR based scalable optical switches have been proposed [9–11]. Moreover, the SIRIUS architecture that harnesses fast wavelength tuning and stationary AWGR forwarding provides platform with high bandwidth and flat connectivity [12]. Nevertheless, the static link capacity in SIRIUS is far away from the optimal for practical scenarios, especially when dealing with modern traffic flows with various characteristics. The research survey from Microsoft data centers shows that only 60% links are utilized under daily operation while others are idle. Moreover, links may be insufficient to handle the traffic burst caused by the hardware off-load, leading to extra packet loss. The static physical interconnection and the uniform bandwidth allocation, for example, in SIRIUS, cannot fully utilize the precious bandwidth and is unable to provide the flexible optical links to adapt the traffic variety. One approach is to allocate network-bound service components to physical device with high bandwidth connectivity through intelligent workload placement algorithms [13]. However, arranging workload placement algorithms on every physical node will dramatically complex the control management, especially for large-scale implementation.

Considering the complexity and the scalability, another method is to dynamically reconfigure optical links based on the interaction between data forwarding and collaborative control protocol. This approach requires the DCN design following the “shape-shift” fashion. Some proposed reconfigurable optical DCNs, such as OWCell, OSA and ROTOS [14–16], providing both topology and protocol to elastically optimize the bandwidth distribution in each top of rack (ToR). However, corresponding defects heavily lag the practical deployment. The OWCell deploys wireless channel between different nodes. The requirement of line-of-sight connection greatly restricts the implementation, where the tiny errors in aliening the TX-RX pair could introduce fatal degradation in performance. Optical switching matrix harnessed in wired OSA builds temporary links between every node, but the limited port radix restricts the overall scalability to only support 2560 servers maximumly. ROTOS deploys SOA-based switches interacting with wavelength selective switch (WSS) to feature the fast packets switching and flexible bandwidth provisioning. However, Optical Flow Control (OFC) protocol combining the electrical buffer (RAM) is employed to solve packet contention, which introduces extra retransmission delay especially under multi-contention scenarios [17].

With the fast wavelength tuning, AWGR provides full connectivity from input ports to output ports, ensuring the flexibility for network topology. Considering the skewed service requirement in realistic DCs, in this paper, based on our previous prototype in [18], we propose and experimentally investigate a novel reconfigurable AWGR based optical DCN named RGAIA. The interconnection is supported through the designed flexible optical ToR with tunable transceivers. The distributed AWGRs and tunable transceivers are deployed as the switching layer. The software defined network (SDN) monitors the real-time traffic information and accordingly changes wavelength selection and link bandwidth to provide flexible optical network interconnection. The experimental assessment indicates RGAIA features satisfied improvement after reconfiguration, with 37% decrease in latency and 65.6% drop in packet loss, compared with the performance under static topology. We also investigate the scalability and reconfigurability under large-scale implementation via OMNeT++ simulation tool. The results prove that the proposed RGAIA provides steady performance among various configurations, including different buffer sizes in ToR, various traffic patterns and heterogeneous aggregation lengths. The numerical assessment also shows that real-time bandwidth reconfiguration can bring dramatic improvement to RGAIA even in the large-scale network.

This paper is organized as follows: section 2 shows the overall architecture, ToR schematic, AWGR routing and the scheduling principle. Section 3 illustrates experimental setup and the corresponding results. In section 4, simulation set-up and numerical results are discussed. Finally, section 5 gives the overall conclusion.

2. RGAIA: Interconnection and working principle

2.1 Overall structure

The overall architecture of reconfigurable RGAIA is illustrated in Fig. 1. Based on the previous work of GAIA [18], we employ the static scheduling sequence instead of the traffic-oriented sequence, ensuring that communication link between each TX-RX pair can be built in time. Moreover, flexible functional blocks in DCN topology enable elastic bandwidth provision: the fast wavelength tuning combing with SDN agents are deployed to adopt the traffic variety. The whole structure is designed in a cluster independent style, with N clusters in total, where each cluster contains N racks and each rack groups H servers. The Ethernet frames generated by servers are classified into three types: Intra-rack traffic, Intra-cluster traffic and Inter-cluster traffic. The Intra-rack type indicates traffic flow happens within the same rack, which will be directly forwarded through Ethernet switch embedded in the corresponding ToR. The traffic destined to other racks but within the same cluster is regarded as Intra-cluster traffic and will be forwarded through the Intra-AWGR in each cluster. Through connecting Inter-AWGR with ${m_{th}}$ rack in each cluster, the cross-cluster communication links are built to deliver Inter-cluster traffic. Thus, the aforementioned flat topology features one-hop link for Intra-cluster traffic, and at most two-hop link for Inter- cluster traffic. Additionally, with the extensive path diversity, ${m_{th}}$ various interconnection links exist between each pair of ToRs. This increases the overall fault tolerance and provides the platform for the further load balancing algorithm.

Fig. 1. Overall architecture of RGAIA.

Download Full Size | PDF

2.2 Schematics of ToR

Figure 2 illustrates the schematic details of the flexible ToR. The Ethernet frames with various destinations are first processed by Ethernet switch through mapping MAC address in look-up table. The traffic, classified as Intra-rack traffic, is directly forwarded to the destination server. The Inter-rack traffic will be furtherly processed in network interference card (NIC). The NIC can be divided into Intra-NIC and Inter-NIC corresponding to process the Intra-cluster traffic and the Inter-cluster traffic, respectively. Before being grouped as optical packets, the electrical frames are firstly stored in RAMs according to their destinations. For the Intra-cluster traffic, frames targeting to the same rack are stored in the same buffer, while for Inter-cluster traffic, packets heading to the same cluster share the same buffer. Each block that aggregates $n$ buffers is directly connected to one tunable transmitter. Thus, the number of buffer blocks needs to satisfy the flowing equation:

(1)$$n \times p = N$$

where $n$ is the number of buffers in one block, $p$ is the number of blocks in each NIC and $N$ is the number of racks in one cluster. The deployment of the Inter-NIC follows the same interconnection as the Intra-NIC.

Fig. 2. Design details of functional blocks in each flexible ToR: TX = Transmitter; RX = Receiver.

Download Full Size | PDF

2.3 AWGR based switching layer and scheduling principle

The tunable transmitters interacting with AWGRs form the optical switching layer. Considering the millisecond wavelength tuning in traditional tunable lasers, the SOA based tunable lasers can realize the nanoseconds tuning process to guarantee the overall throughput [12]. As illustrated in Fig. 3(a), the SOA based gate array can select the desired wavelength: supplying active current allows the corresponding wavelength to pass while others are blocked.

Fig. 3. (a) SOA based tunable laser; (b) AWGR routing for small-scale prototype

Download Full Size | PDF

Additionally, the SOA gates can compensate the power loss during the signal transmission.

The switching functionality of RGAIA is supported by the AWGR, in which different wavelengths coming from one input port will de diffracted cyclically to all output ports. Each tunable transmitter forwards optical traffic to the connected port, and the traffic will be guided to the destination receiver based on the waveguide routing.

In Fig. 1, the dimension of RGAIA is ${N^2} \times H$, and the value of $N$ is mainly restricted by the port radix of AWGR. The maximum interconnection scale for the switching layer is described by the following equation:

(2)$$M \times P \times \lambda = {N^2}$$

where $M$ is the number of deployed AWGRs, $P$ is the port radix of each AWGR, $\lambda $ is the number of employed wavelength types, $N$ is the maximum number of racks can be communicated. Considering the AWGR with hundreds of ports has been experimentally demonstrated, RGAIA can be extended to support tens thousands of servers [19]. Moreover, the solution of cascading AWGRs in [20] can ameliorate the scalability issue, and thus expand the dimension for AWGR-based optical switching networks.

With the passive nature of AWGR, the cooperative scheduling mechanism is important to realize the orderly and high-efficient packets transmission. Corresponding to the routing design in AWGR, the time division multiplexing (TDM), by cutting the transmission time into different slots, is utilized to ensure the communication link between each TX-RX pair can be built in time under the limited transmitting resource. Also, due to the broad-band feature of receivers, each receiver can only process single-wavelength signal in in every slot. Otherwise, the multi-wavelength signal will cause contention in receivers, leading to extra packet loss. Moreover, with the unpredictable traffic soaring in active links, the slot duration needs to be accordingly changed to optimize the bandwidth utilization under new traffic distribution.

The deployment of OpenDaylight (ODL) and OpenStack panels composes the SDN control plane. The FPGA implemented ToR collects the traffic characteristics in each time slot and sends the information to ODL panel. The computation blocks in OpenStack platform delivery the control signal to data plane to realize the real-time update in link capacity, which guarantees the ratio of duration for each temporary link keeps the same as the ratio of the traffic volume carried by each temporary link. The interaction between data plane and control plane is processed by Open Flow agents embedded in ODL, and thus leading to the time-various packet processing in ToRs. The SDN agents can accordingly optimize the slot duration as the ongoing processing via monitoring the buffer occupation and the traffic volume in ToRs. For example, both slots in Table 1 are initialized to last for the unified time to adopt the general traffic cases. When the traffic burst suddenly happens among the active links in first slot, SDN agents will extend the first slot to deal with the traffic congestion, and second slot becomes shorter accordingly to keep the fixed loop (2 slots) time.

Table 1. Routing table for an 8×8 AWGR, connecting with 4 rack, each rack contains two transceivers, and each transceiver uses 2-wavelength laser bank.

View Table | View all tables in this article

For the simplicity, we now demonstrate the small-scale prototype with only 16 racks to explain the scheduling principle, including contention avoiding, slot cut, wavelength selection and the bandwidth redistribution. The depicted prototype groups 4 racks in each cluster. In Intra-NIC, 2 tunable transmitters are deployed, each is equipped with two-wavelength laser bank. For switching layer, we use one 8×8 AWGR with the routing details illustrated in Fig. 3(b).

Referring to the routing design in Fig. 3(b), the TDM operation is modeled via wavelength conversion in each transmitter: by controlling the current activation, SDN agents will assign the predefined wavelength type to the corresponding buffers when a slot begins, and the wavelength assignment keeps until present slot ends. Thus, new wavelength type results in new temporary link to the corresponding destination in AWGR. After considering the routing design and the contention avoiding, one example of scheduling table for our prototype is illustrated in Table 1.

3. Experimental evaluation

Benefitting from the flexible topology and the corresponding elastic components, the FPGA implemented ToR, combining with SDN control plane, can dynamically optimize the bandwidth distribution for optical links to Intra-AWGR and Inter-AWGR. In this section, the overall network performance, including the average latency, total packet loss, and the flexible bandwidth allocation, will be experimentally assessed via connecting discrete physical components.

3.1 Experimental setup

The practical connection shown in Fig. 4 is built to experimentally investigate the performance of RGAIA. Four FPGA (Xilinx VC709) implemented optical ToRs are deployed, where each ToR equips two tunable transceivers with the two-wavelength laser bank. Each laser bank connects to two SOA gates as well as the drivers. The FPGA-based ToR can tune the wavelength via controlling the SOA gates and accordingly select the corresponding packet to be carried on this tuned wavelength. The Spirent Ethernet test center is used to emulate the real-world servers, generating Ethernet frames with variable destinations under dynamic load at 10 Gb/s. The network performance in terms of packet loss and packet end-to-end latency can be measured at the Spirent as well. OpenDaylight and OpenStack platform are used to collect information from each FPGA-ToR and accordingly assign wavelength configuration through OpenFlow (OF) links, and OF agents are employed as the bi-directional translator between PCIE protocol and SDN language. Each FPGA-based ToR can choose the carrier wavelength via controlling the SOA gates and transmit the frames through the MZI-based modulator. One 32×32 AWGR, in which 8 ports are utilized, interconnects every ToR via optical channels at 40 Gb/s. After the photo diode, the electrical packets are sent to the FPGA-based ToR for further processing. During the whole transmission procedure, the power experiences the -3dBm and -5dBm loss at the coupler and the AWGR, respectively. Considering the receiving sensitivity of -20dBm for the receiver and the power amplifying after the SOA, the optical loss budget is sufficient for this experimental configuration.

Fig. 4. Experimental set-up for RGAIA.

Download Full Size | PDF

3.2 Experimental result

The bit-error-rate (BEA) analyzer is first utilized to identify the signal degradation after the routing in AWGR. The back-to-back (B2B) result is measured as the reference. We test the performance under single-wavelength signal and the WDM signal, respectively. As shown in Fig. 5(a), the AWGR introduces <0.2 dB power penalty for single-wavelength signal and <0.5 dB power penalty for error-free operation (BER at 1E-9) of WDM channels. The eye diagrams furtherly prove the acceptable distortion introduced by AWGR operation. Additionally, the experimental performance before and after the reconfiguration, including the average server to server latency and the packet loss, is illustrated in Fig. 5(b). The switching principle follows the scheduling logic in Table 1, and the initialed duration are all 100 ns for these two slots. The Spirent Ethernet test center generates the frames to guarantee that the ratio of traffic volume in these two slots is 3:2. After collecting the corresponding information by Ethernet switch, $To{R_1}$ will send the current traffic characteristics to the SDN control plane through OF links. The corresponding computing engine in OpenStack platform sends back the control signal to $To{R_1}$ and sets the duration for two slots are 120 ns and 80 ns, respectively. Thus, the FPGA implemented ToR can automatically adopt the slot duration to optimize the wavelength distribution with the help of SDN, and without any artificial interactions. With the optimized slot assignment, the latency decreases obviously at the high load situation, reaching the improvement of 37% at load of 0.8. Moreover, as load increasing to 0.5, the packets start to lose before reconfiguration, but with optimized slot duration, the packet loss keeps zero until 0.7 load. and experiences the 65.6% drop when load is 0.8. Benefitting from the intelligent bandwidth distribution from SDN, links under higher pressure are treated with more bandwidth to settle the congestion, leading to the improvement in overall performance.

Fig. 5. (a): BER curves and eye diagram for AWGR based optical links; (b): Performance improvement (latency and packet loss) introduced by optimized slot duration for experimental test.

Download Full Size | PDF

4. Numerical assessment

The OMNeT++ simulation framework is deployed to fully investigate the network performance of RGAIA at the large-scale interconnections.

4.1 Traffic generation

The ongoing applications and services strongly dominate traffic characteristics in DCs. The traffic patterns utilized in evaluation are generalized from real-world data centers as illustrated in Table 2 [21]. Frames come from different servers randomly choose the destinations. All servers generate frames with the load from 0.1 to 1, where the load indicates the ratio between occupied bandwidth and the link capacity of each server. The generated frame length ranges from 64 bytes to 1500 bytes to follow the realistic statistics [22], and Fig. 6(a) and Fig. 6(b) depicts the Cumulative Distribution Function (CDF) and histogram, respectively. The frame length distribution obeys the Bimodal model: more than 80% packets are either short (< 200 bytes) or long (>1400 bytes). The short packets are normally used to carry the control signal and long packets are the payload in every data switching.

Fig. 6. (a) CDF length for frame length; (b) Histogram chart for frame length

Download Full Size | PDF

Table 2. Real-world Traffic Patterns

View Table | View all tables in this article

The period of continuous frames with the same destinations is defined as the traffic flow. According to [23–25], the traffic flow is modeled based on ON/OFF period length, and the length of ON/OFF periods is featured by the Pareto distribution. The equation below models the ON period CDF length:

(3)$$O{N_{CDF}}(x )= \,1 - \,{\left( {1 + \varepsilon \frac{{x - \,{x_m}}}{\theta }} \right)^{ - \frac{1}{\varepsilon }}}x > \,{x_m}$$

where ε is a shape parameter and set to 0.9, θ is a scale parameter and equals to 2746, ${x_m}$ indicates the threshold and is 64 bytes. With the above configuration, our simulation guarantees that more than 80% traffic flows are short (< 10 KB) which matches the characteristics in real world.

Figure 7 illustrates the CDF for ON and OFF periods. The ON period length strictly follows the predefined distribution and is strongly incoherent with the traffic load, whereas the flow gap time (OFF period) features the linear relationship with the load.

Fig. 7. (a) CDF for ON period length; (b) CDF for OFF period length

Download Full Size | PDF

4.2 Parameters setup

The clock and data recovery (CDR) time is set to be 3.1 ns according to the novel clock distribution technique in [26]. Considering each cell (buffer unit) to be 64 bytes, when grouping optical packets for Intra-cluster and Inter-cluster traffic, transmitters process these 64-byte cells instead of long data packets. Thus, optical packet length can only be the integer multiple of 64 bytes, i.e., the 1000-byte packet is aggregated as a 16-cell optical packet. At each RAM block, buffering time for every cell is 80 ns, and with the 51.2-ns duration for processing each cell (64 bytes ${\times} $ 8 / 10 Gb/s), and the aforementioned duration is independent of the network scale according to [27]. The initialed distance between ToR sides and the switching layer is 50 meters, and the transmission latency adding other necessary configuring time is called round trip time (RTT). Based on our previous experimental results, the RTT normally fluctuates around 370 ns. Then, before exploring the performance of RGAIA under the large-scale adoption, experimental configuration in Section 3 is first assessed under the simulation model. As depicted in Fig. 8, the simulation result depicts the similar trend with the experimental result, which in turn verifies the authenticity of our simulation model.

Fig. 8. Numerical assessment of experimental configuration

Download Full Size | PDF

During simulating the large-scale performance, RGAIA is with the constant scale: 64 ToRs group 2560 servers totally, and 8 transceivers are located (4 for intra-NIC and 4 for inter-NIC) and each with two wavelength selections. Both Intra-AWGRs and the Inter-AWGRs are with the scale of 32×32. We have set up several simulation configurations to test the performance under large scale. These configurations try to assess the scalability of RGAIA, in latency, packet loss and throughput, via various influencing factors: buffer sizes, traffic patterns and aggregation lengths. Moreover, we also evaluate the performance via applying different traffic volumes and the corresponding slot durations to fully assess the reconfigurability efficiency at the large scalability.

4.3 Dynamic NIC-buffer dimensions

First, the impact of the buffer size in NICs is assessed. The length of generated frames follows the distribution in Fig. 6. The ON period means the server will generate frames continually while no packets will occur in OFF period. The ON period length and the OFF period length are matching the CDF in Fig. 7(a) and Fig. 7(b), respectively.

Figure 9(a) and Fig. 9(b) show the server-to-server latency and ToR-to-ToR latency with the buffer dimension ranges from 3 KB to 10 KB. Since no retransmission algorithm is adopted under our simulation model, only those packets that have been successfully transmitted are valid when calculating the latency. As shown in Fig. 9(a), the average latency from server to server does not exceed 1.5 μs when load is less than 0.7. Then, as load increases to 0.8, the average latency experiences a rapid increase, and the influence of buffer dimension are dominating: 6 μs latency in 3 KB case under the load of 1, but the latency increases to 9.6 μs under 10 KB buffer when load is 1. This is because the congestion becomes the common case under high load and introduces extra queuing latency to RTT for each transmission. Moreover, with the larger buffer size, more packets stuck at the head of buffers, worsening the queuing time and adding more latency during the packet transmission. Figure 9(b) shows the average transmission latency among optical links, same with the server-to-server latency, the ToR-to-ToR latency tends to increase while applying larger buffer, especially under high-load situation. Figure 9(c) shows the general packet loss. The conclusion drew from the packet loss is completely composite with the situation in latency: with larger buffer, the NIC can store more packets, leading to the dramatic decrease in packet loss.

Fig. 9. Numerical assessment for RGAIA under different ToR buffer dimensions (3 KB; 5 KB; 7 KB; 10 KB): (a) Average server-to-server-latency; (b) Average ToR-to-ToR latency; (c) Total packet loss.

Download Full Size | PDF

Generally speaking, RGAIA can provide a robust performance under various ToR buffers: the latency doesn’t exceed 10 μs and the packet loss is less than 5%, when buffers range from 10 KB to 3 KB. Additionally, with a load of 0.9, RGAIA features the 3.6 μs latency and 0.6% packet loss when deploying the 5KB buffer.

4.4 Various traffic patterns

Secondly, we will assess the elasticity of RGAIA through applying dynamic traffic patterns. Table 2 illustrates the detailed volume distribution for all tested patterns. Under the simulation, each server buffer is 1.25 KB, and the ToR buffer is set to be 5 KB. The length of generated frame also matches Fig. 6, and ON/OFF period length follows the distribution in Fig. 7(a) and Fig. 7(b), respectively.

As shown in Fig. 10(a) and (b), RGAIA gains the robust performance under the Pattern 1, but with the poor performance under Pattern 4. This is because Pattern 1 doesn’t contain any cross-cluster traffic compared with other three traffic patterns, while Pattern 4 integrates the most Inter-cluster traffic. The Inter-cluster traffic, sometimes employs two-hop active links, greatly piles up the packets in both Intra-NIC and Inter-NIC, leading to the extra packet discard and queuing time deterioration at the same time. Without considering the Inter-cluster traffic, RGAIA achieves the 1.8 μs latency and 0.35% packet loss under 0.8 load. Also, RGAIA achieves the satisfied performance under Pattern 4: 4.5 μs latency and 4.07% packet loss at 0.8 load, and 14.7 μs latency and 11.6% packet loss under full-load configuration. Additionally, considering the Pattern 3 is the most common traffic distribution in real-world data centers, with 3.03 μs average latency, and 2.52% total packet loss when load is 0.8, RGAIA can be said to provide robust service for prevailing applications. The Fig. 10 (c) reports the normalized network throughput as a function of the traffic load for different traffic patterns. The throughput starts to saturate when load is 0.7. Generally, the performance doesn’t fluctuate much for different traffic patterns, and thus reflecting the good elasticity of RGAIA when encountering dynamic traffic scenarios.

Fig. 10. Simulation result of RGAIA under dynamic traffic patterns: pattern 1 (50% Intra-rack traffic, 50% Intra-cluster traffic), pattern 2 (62.5% Intra-rack traffic, 25% Intra-cluster traffic, 12.5% Inter-cluster traffic), pattern 3 (50% Intra-rack traffic, 37.5% Intra-cluster traffic, 12.5% Inter-cluster traffic), pattern 4 (50% Intra-rack traffic, 25% Intra-cluster traffic, 25% Inter-cluster traffic): (a) Average server-to-server latency; (b) Total packet loss; (c) Normalized throughput.

Download Full Size | PDF

4.5 Different packet aggregation length

In this section, the performance was explored while transceivers aggregated 7-cell, 14-cell and 24-cell optical packets in NICs, respectively. Every simulation lasts 1 ms, and deploys 5 KB ToR buffers.

Figure 11(a) and Fig. 11(b) shows the ToR-to-ToR latency and the packet loss in NIC under predefined configurations. At the ToR switching layer, with shorter aggregated packets, the performance becomes better. Whereas the conclusion changes when considering the Ethernet switching. Figure 11(c) and (d) show the server-to-server latency and total packet loss, where the 7-cell scenario features the better performance when load is less than 0.7. However, both the latency and packet loss dramatically worsen as load continues to increase. The packet loss in Ethernet switch, as illustrated in Fig. 11(e), can explain this situation. When load is high, the shorter packets can be aggregated more in NICs compared with longer packets within the same simulation duration, as illustrated in Fig. 11(f). This leads to the heavier pressure in Ethernet switches. Also, the shorter optical queue can result in more successful packet transmission at optical switching layer as illustrated in Fig. 11(b), which will furtherly worsen the congestion in Ethernet switches, and finally resulting in the performance degradation.

Fig. 11. Numerical assessment for RGAIA under different aggregation length (7 cells; 14 cells; 24 cells) in NICs: (a) Average ToR-to-ToR latency; (b) Average server-to-server-latency; (c) Number of packets sent during 1-ms simulation; (d) General packet loss; (e) Packet loss in Ethernet switch; (f) Packet loss in NIC.

Download Full Size | PDF

4.6 Reconfiguration performance of large-scale network

Fourthly, we investigate the performance of RGAIA when encountering the traffic burst under large-scale implementation. Table 3 illustrates the different traffic volume in each time slot and the corresponding slot duration. In the simulation, we assume each transceiver will use two wavelength types, and thus within at most one loop (contains two slots), each ToR has established the temporary connection link with another other ToR nodes.

Table 3. Cases for Evaluated Slot Duration and Traffic Distribution

View Table | View all tables in this article

The predefined duration for one loop is 200 ns. For the general traffic generation, the probability of packets go to any ToR should be the same. Thus, the two slots should last equally, with 100 ns individually. However, with the hardware off-load incessantly happens in some slots, the traffic volume on some active links will dramatically increase, leading to different traffic volume ratio in Table 3. Then, the equally distributed slot duration is irrational. Benefitting from the flexible design, the bandwidth distribution can be accordingly optimized. The Fig. 11 illustrates the performance before and after the duration redistribution. The first deployed scenario is case 1 (Fig. 12(a)), where the traffic ratio for two slots is 3:2, the performance is tested under initialized slot duration (100 ns : 100 ns) and reconfigured duration (120 ns : 80 ns). For the latency, the divergence is hard to distinguish under low-load scenario. However, as the load increases, the difference is more and more obvious: 28.6% improvement when load is 0.8 and rising to 32.7% at full load. The original configuration reaches 3.4% packet loss under 0.7 load, while after reconfiguration, the 0.7-load packet loss is only 0.1%, and finally improving 36.73% in numerical result when load is 1. When traffic ratio keeps deviating from 3:2, to 3:1, the improvement follows the similar trend, as illustrated in Fig. 12(b). Under the 3:1 configuration, the latency drops 29.1% after reconfiguration when load is 0.8, and packet loss decreases 58.76% at 0.8 load. When load increases to 1, the optimized slot duration experiences the 44.18% and 37% decrease in latency and packet loss, respectively. As for 4:1 configuration, the flexible bandwidth distribution introduces 32.2% and 38.4% improvement in latency and packet loss respectively when load is 0.8 as depicted in Fig. 12(c).

Fig. 12. Performance improvement in latency and packet loss for: (a) case 1; (b) case 2; (c) case 3.

Download Full Size | PDF

5. Conclusion

We proposed and experimentally assessed the reconfigurable and flat optical data center network RGAIA. Benefitting from the real-time interaction between the flexible optical ToR and SDN agents, RGAIA can intelligently reassign the bandwidth among each active link based on the traffic distribution, and thus providing the robust performance for the scenario with complex applications deployment. The experimental investigation illustrates that the reconfigurable bandwidth distribution improves of 37% in latency and 65.6% in packet loss, respectively, compared with the rigid network interconnections. Additionally, the OMNeT++ based simulation model verifies the dramatic performance improvement introduced by bandwidth reconfiguration even under large-scale implementation: with 2560 servers, RGAIA at least improves 28.6% in latency and 29.2% in packet loss, compared with the performance under static slot duration.

Funding

National Natural Science Foundation of China (62125103, 62171059, 62101065); State Key Laboratory of Advanced Optical Communication Systems and Networks; State Key Laboratory of Information Photonics and Optical Communications (IPOC2021ZT08).

Disclosures

The authors declare no conflicts of interest.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. C. G. C. Index, “Forecast and methodology, 2016–2021 white paper,” (2018). [Online] Available: https://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/white-paper-c11-738085.html.

2. A. Mohammad, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, and M. Yasuda, “Less is more: Trading a little bandwidth for ultra-low latency in the data center,” in 9th USENIX Conf. on Networked Systems Design and Implementation, San Jose, CA, 2012.

3. T. Z. Emara and J. Z. Huang, “A distributed data management system to support large-scale data analysis,” J. Syst. Softw. 148, 105–115 (2019). [CrossRef]

4. K. Wu, J. Xiao, and L. M. Ni, “Rethinking the architecture design of data center networks,” FRONT COMPUT SCI-CHI (2012).

5. J. Krzywda, A. Ali-Eldin, T. E. Carlson, P.O. Östberg, and E. Elmroth, “Power-performance tradeoffs in data center servers: DVFS, CPU pinning, horizontal, and vertical scaling,” Future Gener. Comp. Syst. 81, 114–128 (2018). [CrossRef]

6. A. Ghiasi, “Large data centers interconnect bottlenecks,” Opt. Express 23(3), 2085 (2015). [CrossRef]

7. A. Ghiasi, “Is there a need for on-chip photonic integration for large data warehouse switches,” The 9th International Conference on Group IV Photonics (GFP), 27–29 (2012).

8. K. Prifti, A. Gasser, N. Tessema, X. Xue, R. Stabile, and N. Calabretta, “System Performance Evaluation of a Nanoseconds Modular Photonic Integrated WDM WSS for Optical Data Center Networks,” (OSA, 2019), pp. 1–3.

9. K. Ueda, Y. Mori, H. Hasegawa, and K. Sato, “Large-Scale Optical Switch Utilizing Multistage Cyclic Arrayed-Waveguide Gratings for Intra-Datacenter Interconnection,” IEEE Photonics J. 9(1), 1–12 (2017). [CrossRef]

10. R. Proietti, Y. Yin, R. Yu, C. J. Nitta, V. Akella, C. Mineo, and S. J. B. Yoo, “Scalable Optical Interconnect Architecture Using AWGR-Based TONAK LION Switch With Limited Number of Wavelengths,” J. Lightwave Technol. 31(24), 4087–4097 (2013). [CrossRef]

11. K. Sato, H. Hasegawa, T. Niwa, and T. Watanabe, “A large-scale wavelength routing optical switch for data center networks,” IEEE Commun. Mag. 51(9), 46–52 (2013). [CrossRef]

12. H. Ballani, P. Costa, R. Behrendt, D. Cletheroe, I. Haller, K. Jozwik, F. Karinou, S. Lange, K. Shi, B. Thomsen, and H. Williams, “Sirius: A Flat Datacenter Network with Nanosecond Optical Switching,” (ACM, 2020), pp. 782–797.

13. J. Hamilton, “Data center networks are in my way,” 2009 [Online]. Available: http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_CleanSlateCTO2009.pdf

14. A. S. Hamza, S. Yadav, S. Ketan, J. S. Deogun, and D. R. Alexander, “OWCell: Optical wireless cellular data center network architecture,” in 2017 IEEE International Conference on Communications (ICC) (2017), pp. 1–6.

15. K. Chen, A. Singla, A. Singh, K. Ramachandran, L. Xu, Y. Zhang, X. Wen, and Y. Chen, “OSA: An Optical Switching Architecture for Data Center Networks With Unprecedented Flexibility,” IEEE/ACM Transactions on Networking 22(2), 498–511 (2014). [CrossRef]

16. X. Xue, F. Yan, K. Prifti, F. Wang, B. Pan, X. Guo, S. Zhang, and N. Calabretta, “ROTOS: A Reconfigurable and Cost-Effective Architecture for High-Performance Optical Data Center Networks,” J. Lightwave Technol. 38(13), 3485–3494 (2020). [CrossRef]

17. X. Xue and N. Calabretta, “Nanosecond optical switching and control system for data center networks,” Nat. Commun. 13, 2257 (2022). [CrossRef]

18. J. Che, Z. Liu, and S. Wu, “GAIA: A Contention-free Optical Data Center Network Based on Arrayed Waveguide Grating Router,” in 26th Optoelectronics and Communications Conference, P. Alexander Wai, H. Tam, and C. Yu, eds., OSA Technical Digest (Optical Society of America, 2021), paper S4A.4.

19. S. Cheung, T. Su, K. Okamoto, and S. J. B. Yoo, “Ultra-Compact Silicon Photonic 512 × 512 25 GHz Arrayed Waveguide Grating Router,” IEEE J. Sel. Top. Quantum Electron. 20(4), 310–316 (2014). [CrossRef]

20. C. Lea, “A Scalable AWGR-Based Optical Switch,” J. Lightwave Technol. 33(22), 4612–4621 (2015). [CrossRef]

21. T. Benson, A. Anand, A. Akella, and M. Zhang, “Understanding data center traffic characteristics,” SIGCOMM Comput. Commun. Rev. 40(1), 92–99 (2010). [CrossRef]

22. R. Sinha, C. Papadopoulos, and J. Heidemann, “Internet packet size distributions: Some observations,” USC/Inf. Sci. Inst., Tech. Rep. ISI-TR2007–643 (2007).

23. S. Kandula, S. Sengupta, A. Greenberg, A. Patel, and R. Chaiken, “The nature of datacenter traffic: Measurements and analysis,” in Proc. of 9th ACM SIGCOMM Conf. on Internet Measurement, 2009, pp. 202–208.

24. W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, “On the self-similar nature of Ethernet traffic (extended version),” IEEE/ACM J. Trans. Netw. 2(1), 1–15 (1994). [CrossRef]

25. L. Cottrell, W. Matthews, and C. Logg, “Tutorial on Internet Monitoring & PingER at SLAC,” 2000 [Online]. Available: http://www.slac.stanford.edu/comp/net/wan-mon/tutorial.html.

26. X. Xue and N. Calabretta, “Synergistic Switch Control Enabled Optical Data Center Networks,” IEEE Commun. Mag. 60(3), 62–67 (2022). [CrossRef]

27. Y W Learn W, “Cut-through and store-and-forward Ethernet switching for low-latency environments,” 2008.

Source (ToR, TX)	$λ$	Dst (ToR, RX)	$λ$	Dst (ToR, RX)
$(T o R_{1}, T X_{1})$	$λ_{A}$	(1,1)	$λ_{B}$	(2,1)
$(T o R_{1}, T X_{2})$	$λ_{B}$	(4,1)	$λ_{A}$	(3,1)
$(T o R_{2}, T X_{1})$	$λ_{A}$	(2,1)	$λ_{B}$	(1,1)
$(T o R_{2}, T X_{2})$	$λ_{B}$	(3,1)	$λ_{A}$	(4,1)
$(T o R_{3}, T X_{1})$	$λ_{B}$	(2,2)	$λ_{A}$	(1,2)
$(T o R_{3}, T X_{2})$	$λ_{A}$	(3,2)	$λ_{B}$	(4,2)
$(T o R_{4}, T X_{1})$	$λ_{B}$	(1,2)	$λ_{A}$	(2,2)
$(T o R_{4}, T X_{2})$	$λ_{A}$	(4,2)	$λ_{B}$	(3,2)

Case	Configuration	Slot duration	Traffic volume ratio
Case 1	Before reconfiguration	100 ns:100 ns	3:2
Case 1	After reconfiguration	120 ns:80 ns	3:2
Case 2	Before reconfiguration	100 ns:100 ns	3:1
Case 2	After reconfiguration	150 ns:50 ns	3:1
Case 3	Before reconfiguration	100 ns:100 ns	4:1
Case 3	After reconfiguration	160 ns:40 ns	4:1

Source (ToR, TX)	$λ$	Dst (ToR, RX)	$λ$	Dst (ToR, RX)
$(T o R_{1}, T X_{1})$	$λ_{A}$	(1,1)	$λ_{B}$	(2,1)
$(T o R_{1}, T X_{2})$	$λ_{B}$	(4,1)	$λ_{A}$	(3,1)
$(T o R_{2}, T X_{1})$	$λ_{A}$	(2,1)	$λ_{B}$	(1,1)
$(T o R_{2}, T X_{2})$	$λ_{B}$	(3,1)	$λ_{A}$	(4,1)
$(T o R_{3}, T X_{1})$	$λ_{B}$	(2,2)	$λ_{A}$	(1,2)
$(T o R_{3}, T X_{2})$	$λ_{A}$	(3,2)	$λ_{B}$	(4,2)
$(T o R_{4}, T X_{1})$	$λ_{B}$	(1,2)	$λ_{A}$	(2,2)
$(T o R_{4}, T X_{2})$	$λ_{A}$	(4,2)	$λ_{B}$	(3,2)

Case	Configuration	Slot duration	Traffic volume ratio
Case 1	Before reconfiguration	100 ns:100 ns	3:2
Case 1	After reconfiguration	120 ns:80 ns	3:2
Case 2	Before reconfiguration	100 ns:100 ns	3:1
Case 2	After reconfiguration	150 ns:50 ns	3:1
Case 3	Before reconfiguration	100 ns:100 ns	4:1
Case 3	After reconfiguration	160 ns:40 ns	4:1

RGAIA: a reconfigurable AWGR based optical data center network

Abstract

1. Introduction

2. RGAIA: Interconnection and working principle

2.1 Overall structure

2.2 Schematics of ToR

2.3 AWGR based switching layer and scheduling principle

3. Experimental evaluation

3.1 Experimental setup

3.2 Experimental result

4. Numerical assessment

4.1 Traffic generation

4.2 Parameters setup

4.3 Dynamic NIC-buffer dimensions

4.4 Various traffic patterns

4.5 Different packet aggregation length

4.6 Reconfiguration performance of large-scale network

5. Conclusion

Funding

Disclosures

Data Availability

References

Data Availability

Cited By

Figures (12)

Tables (3)

Equations (3)

Optics Express

Traffic Pattern	Pattern 1	Pattern 2	Pattern 3	Pattern 4
Intra-rack	50%	62.5%	50%	50%
Intra-cluster	50%	25%	37.5%	25%
Inter-cluster	0%	12.5%	12.5%	25%