FPGA-based neural network accelerators for millimeter-wave radio-over-fiber systems

Jeonghun Lee; Jiayuan He; Ke Wang

doi:10.1364/OE.391050

1. Introduction

High-speed wireless communications are highly demanded by end-users to support various broadband applications, and the 5G technology has been widely studied recently to satisfy the rapidly-growing demand. Due to the high-speed requirements, the use of millimeter-wave (mm-wave) frequency range has attracted intensive interest. However, the mm-wave has high free-space propagation loss and typically requires line-of-sight (LOS) links. To solve these limitations, the millimeter-wave radio-over-fiber (mm-wave RoF) systems have been widely studied by leveraging the advantages of optical fibers, such as the low transmission loss and the broad bandwidth. However, the signals in mm-wave RoF systems are impaired and distorted by a number of linear and nonlinear effects induced during signal modulation, amplification, transmission and detection [1], such as the fiber chromatic dispersion and nonlinearities, phase noises, optical and electrical amplifications, and the square-law detection of photo-detectors [2,3]. To overcome these limitations, various techniques have been studied, including both analog processing and digital signal processing techniques [4–7]. However, nonlinear effects are difficult to be solved using these signal processing schemes, and each processing step typically only targets at suppressing one or few impairments, resulting in relatively limited capability [6].

To solve these limitations, neural networks have been proposed and studied to improve the performance of mm-wave RoF systems [8]. Neural networks are widely considered as equalizers or classifiers [9–11], and compared with conventional signal processing methods, they are capable of compensating various linear and nonlinear impairments simultaneously. Therefore, better performance has been achieved in mm-wave RoF systems with neural networks [12].

Although neural networks, such as the fully-connected neural network (FC-NN), the convolutional neural network (CNN), and the binary convolutional neural network (BCNN), have shown promising capabilities in improving the performance of mm-wave RoF systems, they have high computation cost and power consumption [13,14]. Previous studies have mainly focused on the development and adaptation of neural networks in optical communication systems, and GPUs or CPUs in high performance computers have been used to implement neural networks [15,16]. However, the use of such high profile platforms is not practical in many real applications, such as in base stations, due to the high cost and high power consumption [14]. In addition, the latency is a critical issue in mm-wave RoF systems, and the latency requirement will be even more stringent in future wireless communications. However, previous studies have not considered the additional latency of neural networks in mm-wave RoF systems. Thus, power efficient, low-latency, low-cost and practical hardware implementations of neural network signal processors (i.e., neural network hardware accelerators) in mm-wave RoF systems are highly demanded.

Neural network hardware accelerators have been studied in several applications, such as the image recognitions [17]. Both application-specific integrated circuits (ASIC) chips and field programmable gate arrays (FPGAs) have been studied for neural network hardware accelerators. ASIC chips have the advantages of high speed and low power consumption. However, the cost of designing ASIC chips is high, and ASIC chips typically only target at one or few specific tasks, limiting its fabrication and application in large volume with flexibility. On the other hand, FPGA devices are capable of implementing flexible and reconfigurable neural network algorithms, and hence, they have the potential to be used in various applications flexibly. In addition, the FPGA-based neural network accelerator also has parallel computation capabilities with low power consumptions [14,18]. Therefore, the FPGA has been considered as a promising candidate for neural network accelerator designs.

Previous FPGA-based neural network hardware accelerators are mainly used for relatively large datasets, which require considerably high resource usages that lead to proportional processing latency. To solve the long processing delay limitation, several optimization methods for FPGAs to accelerate the neural network computation have been studied [13], and the dataflow optimization method is considered as one of the most promising methods [19,20]. Unrolling loops for convolution computations in CNNs (and BCNNs) [21] and pipelining their computations have been widely adopted in image recognition applications to reduce the latency [22]. However, it requires considerably large amount of hardware resources in FPGA devices to maximize parallel computations and it also leads to very high power consumption. Therefore, this method is not practical for applications requiring low power consumption, small amount of hardware resources (i.e., low cost) and low processing latency, such as in mm-wave RoF systems, especially at the base stations or remote access points.

In this paper, to the best of our knowledge, we propose and demonstrate FPGA-based neural network hardware accelerators for mm-wave RoF system applications for the first time. Both CNN and BCNN FPGA hardware accelerators are investigated experimentally. To achieve the low latency, low power consumption and low cost implementation requirements, a novel inner parallel hardware optimization method is also proposed. With the proposed optimization method, the neural network hardware accelerator can be implemented using compact FPGA platforms with limited hardware resources and low power consumption. Experimental results show that the signal processing latency is reduced by about 44.93% and 45.85% for CNN and BCNN, compared with the FPGA-based neural network accelerators without the optimization method. Compared with the popular ARM Cortex A9 embedded processor, the proposed optimization method achieves latency reductions of 99.45% and 92.79% for CNN and BCNN, respectively. Results also show that the BER performance of the mm-wave RoF system with proposed FPGA-based hardware accelerators is similar with that of the system with neural networks implemented using high-end GPUs, whilst the cost and power consumption are significantly reduced. For CNN and BCNN, the realized power reductions are 86.91% and 86.14%, respectively. The latency of CNN FPGA-based hardware accelerator is also substantially reduced by 85.49% compared with the GPU case. Therefore, the demonstrated FPGA-based CNN and BCNN hardware accelerators provide a promising solution for practical applications in mm-wave RoF systems. The major contributions of this paper are summarized as follows:

• The FPGA-based CNN and BCNN hardware accelerators have been proposed and studied for mm-wave RoF systems for the first time.
• The proposed CNN and BCNN FPGA hardware accelerators have been experimentally demonstrated in a 60 GHz RoF system. Compared with the results obtained using neural networks implemented by GPUs, similar BER performance has been achieved, whilst the power consumption is significantly reduced. Much lower latency is also achieved with the CNN FPGA hardware accelerator.
• A novel optimization method based on inner parallel computation has been proposed, enabling lower latency and the implementation of CNN using a compact FPGA platform to achieve lower power consumption and lower cost. The hardware friendly Leaky-ReLU function [23] has been utilized to further reduce the number of logic resources required and the power consumption. With the proposed optimization method, the neural network hardware accelerator has been realized using low-cost compact FPGAs without any BER performance degradation.

2. FPGA-based CNN and BCNN hardware accelerators for mm-wave RoF systems

2.1 mm-wave RoF system with CNN and BCNN decision schemes

Neural network based decision schemes can be used to suppress various impairments and to achieve improved BER performance in mm-wave RoF systems [12,24]. Compared with the FC-NN scheme, it has been shown that the CNN and BCNN schemes can achieve slightly better BER performance with reduced computation cost [24]. Therefore, in this paper we study the FPGA-based neural network hardware accelerators based on CNN and BCNN architectures. The general system architecture of the mm-wave RoF system considered is shown in Fig. 1, where the double-sideband carrier-suppression (DSB-CS) modulation scheme is utilized. The CNN or BCNN based decision scheme is implemented at the receiver side. As discussed in the previous section, the neural networks in previous studies have been implemented using high-end GPUs, which is not practical in RoF system applications due to the cost and power consumption considerations. In addition, the latency of neural networks, which is critical for wireless communications, has not been considered in previous studies.

Fig. 1. The architecture of 60 GHz mm-Wave RoF system with CNN- and BCNN-based decision schemes. DFB: distributed feedback laser; PC: polarization controller; MZM: Mach Zehnder modulation; EDFA: Erbium doped fiber amplifiers; BPF: bandpass filter; SMF: single-mode fiber; PD: photo-detector; AMP: amplifiers; LO: local oscillators; and DSO: digital sampling oscilloscope.

Download Full Size | PDF

To facilitate the description and discussion of proposed FPGA-based neural network hardware accelerators, here we briefly show the working principles and architectures of the CNN and BCNN based decision schemes. More details are available in [24]. The CNN decision scheme is shown in Fig. 2(a), which consists of the input layer, 2 convolutional layers, and the output layer. Our CNN decision scheme carries out 1-D convolution operation, followed by 1-D maxpooling and Leaky-ReLU nonlinear activation function. The output layer is a fully-connected layer that computes multiplications and additions to generate the final symbol decision. The 1-D convolution computation in the CNN decision scheme can be expressed as [24]

(1)$${Conv}^{{n}} = \displaystyle {B}^{{n}} + \sum_{1}^{{N}}\sum_{1}^{{R}}\sum_{1}^{{F}}{X}^{{n}} \otimes {K}^{{n}}$$

where B is the bias parameter, X is the input data of each convolutional layer, K is the kernel set, Conv is the outcome of the convolution computation, n is the layer number, N is the number of kernel sets, R is the data size , and F is the kernel size. By conducting the convolution computation, signal characteristics carried by received symbols can be effectively learned, and the impairments and distortions can be suppressed, such as the inter-symbol interference (ISI).

Fig. 2. The architecture of neural network decision schemes for mm-wave RoF systems. (a) CNN; and (b) BCNN.

Download Full Size | PDF

The BCNN based decision scheme is shown in Fig. 2(b), which consists of the input layer, 3 convolutional layers, and the fully-connected output layer. The major difference from CNN is that the convolution computation is implemented by using the most-significant-bit (MSB), i.e., the sign bit, multiplications and additions, which can be expressed as [24]

(2)$${Binary-Conv}^{{n}} = \displaystyle {B}^{{n}} + \sum_{1}^{{N}}\sum_{1}^{{R}}\sum_{1}^{{F}}{MSB}\left({X}^{{n}}\right) \times {MSB}\left({K}^{{n}}\right)$$

where the notations for variables are the same as the CNN case above.

For both CNN and BCNN based decision schemes, in their convolutional layers, the Leaky-ReLU is used as the nonlinear activation function, which can be expressed as:

(3)$${f(x)}=\begin{cases} 0.25\cdot x, & x<0\\ x, & x \geq 0 \end{cases}.$$

The leaky-ReLU activation function is selected in the CNN and BCNN decision schemes since it can achieve the best accuracy and avoid the gradient vanishing problem during the training process [25]. It is also selected due to the potential of reducing the need for hardware resources, as will be discussed later in Section 2.3.

Training procedures for both CNN and BCNN were conducted using software implementations with a computer. The Tensorflow Framework was used for training with the learning rate of 0.0005. The batch size was experimentally optimized and selected at 1024. The Adams optimization algorithm was used for training [26]. 50% of the received data was used for training and the other 50% was used for testing. We combined the received data with different power levels (with the same fibre length) to form the better generalized training dataset.

The overfitting issue is critical and may result in the over-estimation of the neural network capability. To avoid the overfitting issue, we used three methods in the measurement. Firstly, the training dataset consisted of the received symbols from various received optical power levels, and hence, the dataset is better generalized [27]; secondly, truly random data was transmitted instead of PRBS data, which avoids the possible learning of data pattern when the PRBS pattern was used; and thirdly, the architectures of the proposed neural networks were relatively simple (i.e., the input sequence length is short), which also helps to avoid the overfitting issue in neural networks.

2.2 FPGA-based neural network hardware accelerator for mm-wave RoF systems

Although neural networks have shown great capabilities in solving the limitations in mm-wave RoF systems, they require considerably high computation cost and induce relatively long latency. In addition, high performance computing platforms equipped with GPUs are widely used to implement neural networks, which also results in high power consumptions and high costs. To solve these limitations, here we propose and study the FPGA-based CNN and BCNN hardware accelerators for the mm-wave RoF system. We focus on CNN and BCNN hardware accelerators here, since they have shown better capability and lower computation cost than the FC-NN in RoF systems in our previous study [24].

The overall architecture of FPGA-based CNN and BCNN hardware accelerators for mm-wave RoF systems is depicted in Fig. 3. The neural network hardware accelerator mainly consists of the micro-controller, direct memory access (DMA), off-chip memory, on-chip memory controller, on-chip memory, CNN/BCNN decision scheme IP for RoF systems, and AXI bus system. The on-chip memory, which is typically the distributed block RAM (BRAM), is used to store the weight and bias parameters of the neural network decision scheme, and the off-chip memory is used to store the received signal (i.e., dataset) of the mm-wave RoF system. The received signal stored in the external memory is transferred to the implemented CNN- and BCNN-based decision schemes via the AXI bus system controlled by the DMA, which is an IP for controlling data flows from the external memory to the on-chip memory. Then the received signal of the mm-wave RoF system is processed with the weight and bias parameters stored in the on-chip memory during the inference period. The sequence and flow of data and neural network parameters are controlled by the soft-IP micro-controller, which can be programmed with C or other FPGA Software Development Kit (SDK) tools. We use the Timer and the universal asynchronous receiver transmitter (UART) blocks to measure the latency and to show the decision results from the CNN or BCNN decision scheme, respectively.

Fig. 3. The overall architecture of FPGA-based CNN or BCNN hardware accelerator for mm-wave RoF systems.

Download Full Size | PDF

As discussed in the previous section, low latency, low hardware resource requirements and low power consumption are highly desired in the FPGA-based neural network accelerators for mm-wave RoF systems. To realize these requirements, the CNN and BCNN based decision schemes are implemented in 3 different hardware accelerator architectures, and they are shown in Fig. 4. Figure 4(a) depicts the CNN1 and CNN2, which are implemented with the non-optimized method and the fully unrolled and pipelining optimization method (widely used in image recognition applications) [21,22], respectively. In both CNN1 and CNN2 architectures, the convolutional layers are implemented sequentially. Thus, the multiplications and additions are followed by the maxpooling and Leaky-ReLU function operations in each convolutional layer, before the operations in the next convolutional layer can be executed. After all convolutional layers, the multiplication and addition operations in the fully-connected output layer are then executed to generate the final symbol decisions. The notations N and M in Fig. 4(a) represent the number of parallel computation units in the convolutional layers. For CNN1, both N and M equal to 1 and for CNN2, they are larger than 1 and are determined by the synthesis strategy and the available resources. Comparing the CNN1 and CNN2 architectures, it can be seen that the CNN2 uses parallel computations, and hence, it is capable of reducing the processing delay caused by the neural network, which is important for mm-wave RoF applications. However, as will be discussed later, the FPGA implementation of CNN2 requires a significantly larger amount of hardware resources and leads to higher power consumptions, which set practical application limitations.

Fig. 4. The architecture of FPGA-based CNN and BCNN hardware accelerators. (a) CNN1 and CNN2 architectures; (b) CNN3 architecture with the proposed optimization method; (c) BCNN1 and BCNN2 architectures; and (d) BCNN3 architecture with the proposed optimization method.

Download Full Size | PDF

Figure 4(c) illustrates the BCNN1 and BCNN2 architectures, which are similarly implemented with the non-optimized method and the fully-unrolled and pipelining optimization method [21,22], respectively. Similar with the CNN case, BCNN1 and BCNN2 also execute convolutional layers sequentially, and the major difference is that the convolutional layers are binarized. The notations of J / K / L also represent the number of parallel computation units, which are set to 1 in BCNN1 and decided by the synthesis strategy and the available resources of FPGA platforms in BCNN2. Similar with the CNN hardware accelerator case, the BCNN2 architecture uses parallel computation to reduce the processing latency, at the cost of higher power consumptions and larger amount of hardware resources.

Due to the different data types used during computations, the dominant computation units required in the CNN and BCNN hardware accelerators are different. In FPGA-based CNN hardware accelerators, 32-bit floating point numbers in the IEEE-754 standard [28] are used for computations, and hence, as shown in Fig. 4(a), the DSP IP blocks are used for their hardware implementations. More specifically, the multiplications (fMUL) and additions (fADD) in the convolution computations described by Eq. (1) are implemented with DSP IP blocks. The DSP IP blocks are also needed in the following 1-D maxpooling and Leaky-ReLU operations in the convolutional layers, since they require comparison (fCMP) and multiplication (fMUL) computations, which also use 32-bit floating point numbers. The fully-connected output layer requires DSP IP blocks as well, since the output layer also requires multiplications and additions using 32-bit floating point numbers. Therefore, a large number of DSP IP blocks are required in the FPGA-based CNN hardware accelerators.

On the other hand, in the BCNN hardware accelerators, binary convolutional layers are mainly used, except the first layer (the first layer processes the received data). The binary numbers in FPGAs are normally handled with logic gates instead of DSP IP blocks. Therefore, in the BCNN hardware accelerators shown in Fig. 4(c), the first convolutional layer is implemented with DSP IP blocks, since 32-bit floating point numbers (i.e., the input data) are processed in this layer to achieve high accuracy. The following convolutional layers, which execute the binary convolution computation using the most significant bit (MSB) as described in Eq. (2), are realised with XNOR logic gates. In the output layer, since it also processes binary numbers, the logic gates are utilized. Therefore, the logic gates in FPGAs are mainly used in the BCNN hardware accelerator implementations.

In addition, as shown by Eq. (1) and Eq. (2), the convolution and binary convolution operations require a large amount of multiplications, additions and subtractions (for BCNN only). Therefore, when the fully unrolled and pipelining optimization method is adopted to reduce the processing latency, considerably high amounts of hardware resources are required. Specifically, a very large number of DSP IP blocks are needed for the CNN2 architecture, and a very large amount of logic gates are required for the BCNN2 architecture. The hardware requirements and usages also result in high power consumptions in the CNN2 and BCNN2 FPGA implementations.

2.3 Inner parallel computation optimization for FPGA-based CNN and BCNN hardware accelerators in mm-wave RoF systems

As discussed in the previous section, the un-optimized FPGA-based CNN and BCNN hardware accelerators do not support parallel computation, which results in relatively long latency, whilst the ones with the fully unrolled and pipelining optimization method require a large amount of hardware resources and have high power consumptions. Since the unrolled and pipelining optimization method is used to maximize the parallel computation capability in CNN2, the repetitive multiplication and addition operations are utilized in the convolutional layers. Because the CNN architecture uses 32-bit IEEE-754 format floating point numbers, the multiplication and addition operations in the convolutional layers are implemented with the DPS blocks in the FPGA. Therefore, the unrolled parameters in CNN2 are determined by the DSP blocks available and the synthesis strategies. As expressed by the Eq. (1), the parameters are mainly decided by the number of kernel sets, the size of input data length and the kernel size. The parallelized computation in CNN2 can be seen from the number of DSP blocks used as shown in Table 2. To solve these limitations and to meet the requirements of mm-wave RoF systems, which demand low latency, low computation hardware requirement and low power consumption, here we propose the inner parallel computation optimization method and the use of hardware friendly Leaky-ReLU nonlinear activation function [23] for the FPGA-based CNN and BCNN hardware accelerators. We refer them as CNN3 and BCNN3, respectively, and their architectures are illustrated in Fig. 4(b) and Fig. 4(d), respectively. The proposed optimization method is described in detail in Algorithm 1. In the proposed optimization method, the convolution operation, which consists of multiplications (fMUL), additions (fADD) and subtractions (fSUB), is computed in parallel from both the start and the end sides of the input data stream, which is stored in the on-chip memory with known addresses. After the convolution computation, the nonlinear activation functions and the maxpooling are also computed in parallel. Because the proposed method computes in parallel from both sides of data stored in on-chip memories, the processing latency can be improved compared with the non-optimized implementations (i.e., CNN1 and BCNN1). In addition, compared with the CNN2 and BCNN2, the proposed optimization method requires less hardware resources and has lower power consumption.

The hardware resources required in the proposed CNN3 and BCNN3 hardware accelerators can be further reduced by using the Leaky-ReLU nonlinear function. This is because that the Leaky-ReLU function can be realized with the arithmetic right shift operation, which can be synthesized with logic gates in FPGAs instead of using DSP IP blocks [23]. Compared with the Leaky-ReLU function implemented with DSP IP blocks, the Leaky-ReLU realized with the arithmetic right shift operation requires less hardware resources, and hence, the computation cost and power consumptions are further improved. The use of DSP IP blocks and logic gates in the Leaky-ReLU function can also be reduced by optimizing the co-efficient for the negative results of convolution computation in the function as expressed in Eq. (3) [23]. We experimentally optimized the co-efficient and it is selected as 0.25 to minimize the use of hardware resources whilst maintaining the BER performance.

2.4 Experimental setup

The FPGA-based CNN and BCNN hardware accelerators for mm-wave RoF systems was experimentally demonstrated using the setup illustrated in Fig. 1. The 60 GHz frequency was used, and the DSB-CS scheme was adopted at the transmitter side, where a DFB laser at 1550 nm served as the light source and a dual-driven Mach-Zehnder modulator (MZM) driven by two complementary 30 GHz RF signals achieved the optical carrier suppression. By using another MZM, 5 Gb/s signal to be transmitted was modulated. True random data generated by the avalanche effect was transmitted and the modulation format was NRZ-OOK due to its simplicity. The modulated optical wave was then amplified by an Erbium doped fiber amplifier (EDFA), filtered by an optical bandpass filter (BPF), transmitted via the single-mode fiber (SMF), and detected with a high-speed PIN photo-detector (PD). Due to device limitations, the wireless propagation part was not included in the experiment, and the converted electrical signal was down-converted directly by a RF mixer. To suppress the impairments in the system and to improve the BER performance, the detected signal was then processed by the CNN or BCNN based decision scheme. A high-speed digital sampling oscilloscope (DSO) was used before the CNN or BCNN decision scheme to serve as the analog-to-digital converter (ADC). In the experiment, the CNN and BCNN decision schemes were implemented using both the GPU and the FPGA hardware accelerators proposed and discussed in the previous section.

The FPGA-based CNN and BCNN hardware accelerators for mm-wave RoF systems were realized using two FPGA platforms, i.e., Xilinx VC709 and Xilinx Arty-7, and their specifications are compared in Table 1, including the number of BRAM, DSP IP blocks, flip-flop (FF) and look-up table (LUT), which are the fundamental reconfigurable resources in the FPGA. It is clear from the table that the Xilinx VC709 FPGA has a significantly larger number of resources available, whilst the Xilinx Arty-7 is more resource-limited. We selected to use these two FPGAs here to show the capability of implementing the proposed neural network hardware accelerates for mm-wave RoF systems on both the high-end and the compact FPGA platforms, which can satisfy different application scenarios, such as in base stations and in embedded devices.

Table 1. FPGA specifications

View Table | View all tables in this article

More specifically, the VC709 platform has the Virtex-7 FPGA device with 4GB external DDR3 Synchronous Dynamic Random Access Memory (SDRAM), and the Arty-7 board has the 7-series FPGA device with 256MB DDR3 SDRAM off-chip memory. As discussed in Section 2.2, the off-chip DDR3 SDRAMs were used to store the measured datasets (i.e. received signal of the mm-wave RoF system). Xilinx Vivado High Level Synthesis (HLS, version 2017.3) was used to generate the CNN and BCNN source code programmed with C and to synthesize for the hardware description languages (HDL). The Vivado Design Suite was then used to implement the CNN and BCNN decision schemes for the 60 GHz mm-wave RoF system on the two FPGA platforms. The operating clock speeds for the VC709 platform and the Arty-7 platform were 100 MHz and 83 MHz, respectively.

2.5 Results and discussions

To demonstrate the feasibility of proposed FPGA-based CNN and BCNN hardware accelerators in the 60 GHz mm-wave RoF system, we firstly measured the BER performance at different fiber transmission distances. As the comparison benchmark, we also processed the received signal with the CNN and BCNN decision schemes implemented using the GPU, nVidia M5000M. The results are shown in Fig. 5. It can be seen that for up to 20 km optical fiber transmission distance, BER performances within the forward-error correction (FEC) limit can be achieved in the mm-wave RoF system with the FPGA-based CNN or BCNN hardware accelerator. It is also clear that similar BER performances are achieved for all tested transmission distances when the neural networks are implemented with either the GPU or the FPGA, confirming the capability of the proposed FPGA-based CNN and BCNN hardware accelerators in RoF systems. It can be seen that the BER curves in Fig. 5 are not very smooth, although the averaged value of 40 testings was used. The major reasons are the randomness of the neural networks initialization parameters and the randomness of the stochastic gradient descent based learning processes. All three CNN and three BCNN FPGA-based hardware accelerator architectures as shown in Fig. 4 were implemented. All three CNN FPGA hardware accelerator models (i.e., CNN1-CNN3, and also the case for BCNN FPGA hardware accelerators) have the same CNN structure as shown in Fig. 2, and hence they achieve the same BER performance. The major difference amongst CNN1-CNN3 (also amongst BCNN1-BCNN3) is the optimization method applied. In addition, compared with the FPGA-based BCNN hardware accelerator, better BER performance can be realized using the CNN implementation for all fiber transmission distances. The worse BER performance of the BCNN hardware accelerator is mainly due to the reduced bit sizes for variables (i.e., binary values) and the additional losses during binarization.

Fig. 5. Experimental results on the BER performance of the 60 GHz mm-wave RoF system. (a)fiber length = 10 km; (b)fiber length = 15 km; and (c)fiber length = 20 km.

Download Full Size | PDF

The hardware resource requirement and the processing latency of the FPGA-based CNN hardware accelerators using the three architectures shown in Fig. 4(a) and Fig. 4(b) were also experimentally analyzed. The results are presented in Table 2 and Table 3, respectively. It can be seen that with the VC709 platform, the non-optimized CNN1 implementation requires 15 DSPs, 48 18Kb BRAMs, and 40.1K LUT logic units. To implement the CNN2 architecture with the fully-unrolled and pipelining optimization, 23.6 times more DSP IP blocks and 2.73 times more LUTs are required. The significantly larger number of hardware resources used in the CNN2 implementation enables about 6.95 times faster processing (i.e., the latency is reduced by 85.62%). On the other hand, 30 DSP IP blocks, 51 18Kb BRAMs, and 43.1K LUTs are needed to implement the CNN3 architecture with the proposed inner parallel computation optimization method. Compared with the CNN1 implementation, two times more DSP IP blocks and 10% more 18Kb BRAMs and LUTs are needed. The slightly larger number of hardware resources needed in CNN2 realizes 1.81 times faster processing (i.e., the latency is reduced by 44.93%).

Table 2. Resource utilization of FPGA-based CNN hardware accelerators

View Table | View all tables in this article

Table 3. Performance comparison of FPGA-based CNN hardware accelerators

View Table | View all tables in this article

In addition to the implementation on the VC709 platform, which is a high-end FPGA platform, it is also highly desirable to implement the FPGA-based CNN hardware accelerators using more compact and lower-cost FPGA platforms for practical mm-wave RoF system applications. To satisfy this need, we also implemented the hardware accelerators using the compact and resource-limited Arty-7 platform. Due to the large number of hardware resources needed, CNN2 cannot be implemented on the compact FPGA platform, limiting its practical applications. On the other hand, CNN1 and CNN3 can be realized, and the results are also shown in Table 2 and Table 3. It is clear from the results that compared to the VC709 implementations, both CNN1 and CNN3 implementations using Arty-7 require a smaller amount of hardware resources. This is mainly due to the difference between DMAs in the VC709 and Arty-7 platforms. Regarding the latency performance, using the Arty-7 platform, the CNN3 implementation with the proposed inner parallel optimization method achieves over 60.32% latency reduction against the un-optimized CNN1 implementation.

In addition to the hardware resource requirement and the processing latency, the power consumption is also an important parameter for FPGA-based neural network hardware accelerators. Therefore, we also characterized the power consumption performance of the three CNN hardware accelerator architectures experimentally. The results are shown in Table 3. It can be seen that compared with the un-optimized CNN1, although the processing latency is reduced in CNN2, due to the significantly larger number of hardware resources required, the power consumption is increased by more than 28.25% when implemented using the VC709 platform. On the other hand, the CNN3 architecture with the proposed optimization method only consumes less than 3% more power compared with the CNN1 baseline architecture, whilst the latency is reduced by about 44.92%. In the Arty-7 compact platform, compared with the CNN1, the implementation of CNN3 achieves about 60.32% latency reduction, at the cost of less than 2% increase in the power consumption. Therefore, the FPGA-based CNN3 hardware accelerator with the proposed inner parallel optimization method is capable of being implemented on compact and low-cost platforms and achieving significantly reduced latency, with only slightly increased power consumption. Since no FPGA-based neural network hardware accelerators have been demonstrated for RoF systems to the best of our knowledge, it is challenging to compare the performance achieved here with prior work. In addition, since the input data processed here is the time-series of received communication symbols, one-dimensional data is processed and one-dimensional convolutional computations are utilised. On the other hand, most previous studies on FPGA CNN hardware accelerators focus on two-dimensional images and hence, two-dimensional convolutional computations are adopted. Therefore, it is difficult to compare the proposed inner parallel optimization performance with previous optimization applied FPGA-based neural network hardware accelerators performance.

In addition to the CNN, the FPGA-based BCNN hardware accelerators were also demonstrated using the three architectures shown in Fig. 4(c) and Fig. 4(d) with the VC709 platform. The experimental results are shown in Table 4 and Table 5. The non-optimized BCNN1 as a comparison baseline requires 5 DSP IP blocks, 84.5 18Kb BRAM blocks, and 28.8K LUTs. Similar with the CNN case, a significantly larger number of hardware resources are required to implement the BCNN architecture with the fully unrolled and pipelining optimization (i.e., BCNN2), whilst the implementation of BCNN with the proposed inner parallel optimization method (i.e., BCNN3) only requires slightly more hardware resources. The processing latency and power consumption of the three BCNN FPGA hardware accelerators were also measured. The GPU power consumption was measured by monitoring the power consumption tool supported by nVidia driver during the testing period (i.e., when the GPU executed CNN and BCNN based decision schemes). We closed other software that may utilize the GPU during the measurements to reduce the measurement error. From the results shown in Table 5, it is clear that although better latency is achieved, the BCNN2 also consumes much higher power (51.78% higher) compared with the un-optimized BCNN1, in addition to the larger number of hardware resources required. On the other hand, the BCNN3 with the proposed optimization method reduces the latency by about 45.85% over the BCNN1, whilst the increase on power consumption is negligible (less than 1%).

Table 4. Resource utilization of FPGA-based BCNN hardware accelerators

View Table | View all tables in this article

Table 5. Performance comparison of FPGA-based BCNN hardware accelerators

View Table | View all tables in this article

Comparing the FPGA-based CNN and BCNN hardware accelerators, it can be seen that the CNN hardware accelerators mainly require DSP IP blocks to support floating point number computations, whilst the BCNN hardware accelerators mostly need BRAMs and LUTs for logic gates. This is consistent with the discussion in Section 2.2. Because of this hardware requirement and due to the limited number of LUTs available in the Arty-7 FPGA platform, currently implementing BCNN1 and BCNN3 in the compact Arty-7 platform is not feasible. However, it is possible to implement BCNN1 and BCNN3 using other compact FPGA platforms with more LUTs, such as the XC7A50T FPGA platform. In addition to the difference on the hardware resource requirement, the FPGA-based CNN and BCNN hardware accelerators also have different power consumption and latency performances. In general, as can be seen from Table 3 and Table 5, the CNN hardware accelerators can achieve more than 1-order-of-magnitude better latency performance with comparable power consumption. Therefore, the CNN hardware accelerators are in general better suited for mm-wave RoF system applications.

From the results presented above, it is clear that the BCNN FPGA hardware accelerators perform worse compared with the CNN FPGA hardware accelerators. This is mainly because of the binarization functions and the XNOR operation in the first convolutional layer. These operations cause the BCNN being implemented with a deeper architecture, and hence, the latencies are worse than the CNN-based FPGA hardware accelerators. Due to the accuracy loss during binarization in the BCNN, the BER performance of BCNN is also worse than the CNN. BCNN-based FPGA hardware accelerators are mainly used as a comparison benchmark for the CNN to compare the hardware usage, latency and complexity. In addition, although the BCNN performance is relatively bad, it does have a unique advantage that can be considered as important in some applications, which is the lower DSP usages. Therefore, it can be implemented onto FPGAs with limited DSP resources. Furthermore, the BER performance of the BCNN can be improved by adopting the batch normalization [29].

From the results and discussions presented above, it can be concluded that for both CNN and BCNN, compared with the FPGA-based hardware accelerator with architecture 1, the architectures 2 and 3 can achieve improved latency performance at the cost of increased power consumption. To further compare the capability and efficiency of architectures 2 and 3, here we define the efficiency index parameter, which is the latency improvement at the cost of unit increase in the power consumption, and it can be expressed as:

(4)$$Efficiency\ Index = \displaystyle \frac{Latency \ improvement \ ratio}{Power\ increase\ ratio}$$

where the latency improvement ratio and the power increase ratio are defined as the corresponding improvement or increase compared with those of the architecture 1 FPGA implementation.

Both the latency improvement ratio and the power increase ratio in the efficiency index are calculated as relative parameters based on the baseline model, which is the non-optimized model (i.e., CNN1 or BCNN1). The latency improvement ratio and the power increase ratio in Eq. (4) can be expressed as:

(5)$$\displaystyle \frac{ \lvert The\ latency\ of\ the\ improved\ model\ - \ The\ latency\ of\ the\ baseline\ model \rvert}{The\ latency\ of\ the\ baseline\ model}$$

(6)$$\displaystyle \frac{ The\ power\ of\ the\ improved\ model\ - \ The\ power\ of\ the\ baseline\ model }{The\ power\ of\ the\ baseline\ model}$$

As shown by the results presented above, the latency improvement of the FPGA based hardware accelerator is realized by the optimization method, and the power consumption increase is caused by the additional hardware resources utilized, which is the cost of the optimization method. Thus, the CNN1 is selected as the baseline since it does not have any optimization applied. By calculating the efficiency index in this manner, reasonably fair comparisons between optimization methods can be realized. In our opinion the efficiency index also provides a reasonable measure covering both the latency and the power consumption perspectives of FPGA based neural network hardware accelerators. This is important for a large portion of the targeted telecommunication applications, where both low latency and low power consumption are needed to satisfy the real-time and the high quality-of-service requirements. In addition, since the efficiency index defined is a relative parameter compared with a baseline implemented using the same FPGA, the impact of FPGA hardware on the optimization method performance is largely avoided. It should also be noted that in reality, some applications have more stringent requirement on the latency and others require more about the power consumption. In these scenarios, the actual values of the latency or the power consumption should be used instead of the efficiency index.

The efficiency index are calculated for both CNN and BCNN schemes, and the results are shown in Table 3 and Table 5. It is clear that the FPGA-based hardware accelerators with the architecture 3 achieves better efficiency index compared to the architecture 2. The relative importance of latency and power consumption really depends on the actual application. Therefore, the proposed inner parallel optimization method is better suited than the fully unrolled and pipelining method in mm-wave RoF applications, where a balanced performance between latency and power consumption is highly required. In addition, the proposed optimization method also enables the implementation of CNN hardware accelerator in compact and low-cost FPGA platforms, and hence, it facilitates applications in RoF base stations or embedded devices. Due to the lower power consumption, the efficiency index is even higher when the proposed inner parallel optimization method is implemented using the compact and resource-limited FPGA platform. However, the architecture 2 is a better solution for latency-critical applications, such as the ultra-reliable low latency communications (URLLC), where the latency is regarded as more important than the power consumption.

The performance of the proposed FPGA-based CNN and BCNN hardware accelerators for mm-wave RoF systems was also compared with one of the popular embedded processors, ARM Cortex A9. The ARM Cortex A9 processor executes instructions and computations in the designed pipeline architecture with a reduced instruction set computer (RISC), and it has been widely used to benchmark the performance and capability of FPGA-based neural network hardware accelerators [30,31]. The results are also shown in Table 3 and Table 5. It is clear that for both CNN and BCNN based decision schemes, although the clock speed of Cortex A9 is much faster than the clock speeds of VC709 and Arty-7 FPGA platforms, the signal processing latency induced is significantly longer. This is due to the longer floating point number computations and the less parallel architecture in Cortex A9. Specifically, the FPGA-based CNN and BCNN hardware accelerators with the proposed optimization method (i.e., CNN3 and BCNN3) can achieve processing latency reductions of 99.45% and 92.79%, respectively, at the cost of about 43.12% and 41.34% higher power consumptions.

In addition, the latency and power consumptions of CNN and BCNN GPU implementations were also measured and they are compared in Table 3 and Table 5. Compared with the GPU implementations, the latency of CNN FPGA hardware accelerator with the proposed optimization method is reduced by 85.49%, together with 86.91% reduction on the power consumption. For the BCNN, although the latency of the FPGA hardware accelerator with the proposed optimization method is longer compared with the GPU implementation, the power consumption is reduced significantly by 86.14%.

3. Conclusions

In this paper, we have studied and demonstrated FPGA-based CNN and BCNN hardware accelerators for mm-wave RoF systems, and a novel inner parallel computation optimization method has been proposed to further enhance the capabilities of the hardware accelerators. Experimental results have shown that CNN- and BCNN-based decision schemes implemented in FPGA hardware accelerators can achieve similar BER performance as those obtained using GPUs, and the BER within the forward-error-correction (FEC) limit can be achieved for up to 20 km fiber transmission distance.

Three FPGA-based CNN and BCNN hardware accelerators have been implemented and demonstrated. Results have shown that the architecture 1 (i.e., CNN1 and BCNN1) with the non-optimized method requires the smallest number of hardware resources and has the lowest power consumption, whilst the latency is long, which is problematic for RoF applications. The architecture 2 (i.e., CNN2 and BCNN2) improves the latency considerably, whilst the applied optimization method requires a significantly larger amount of hardware resources and has much higher power consumption. The architecture 3 (i.e., CNN3 and BCNN3) with the proposed inner parallel optimization method applied improves the latency substantially, whilst the increase in resources and power consumption is moderate to minimum. To compare the optimization methods, an efficiency index has been defined to measure the capability and efficiency of latency improvement per unit increase of the power consumption. It has been shown that the inner parallel optimization can achieve better efficiency index, and hence, it is better suited for mm-wave RoF applications, which require both low power consumption and low latency simultaneously.

In addition, more general comparisons have been conducted by comparing the performance of FPGA-based CNN and BCNN hardware accelerators with those implemented using the popular embedded processor (ARM Cortex A9) and the GPU (nVidia M5000M). Results have shown that compared with Cortex A9, the FPGA implementations with the proposed optimization method can achieve processing latency reductions of 99.45% and 92.79% for CNN and BCNN, respectively, at the cost of moderately increased power consumption (about 43.12% and 41.34% for CNN and BCNN). Besides, compared with the GPU implementation, the power consumption of CNN and BCNN FPGA-based hardware accelerators with the proposed inner parallel optimization method is reduced by about 86.91% and 86.14%, respectively, and the latency is also reduced by 85.49% for the CNN case. Therefore, the FPGA-based neural network hardware accelerators demonstrated in this paper provide a promising solution for mm-wave RoF systems.

In this work, the system studied uses the single-carrier modulation (SCM) format, which is also widely used in high-speed data centre communications and access network. We select the mm-wave RoF system to demonstrate the impairments suppression capability of the CNN or BCNN FPGA based hardware accelerators, since this type of system typically suffers from severe impairments. It should be noted that in addition to SCM, OFDM or other multi-carrier modulation formats have also been widely used in 5G applications, and FPGA based neural network accelerators for these formats need further investigation.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

References

1. Y. Tian, S. Song, K. Powell, K. Lee, C. Lim, A. Nirmalathas, and X. Yi, “A 60-ghz radio-over-fiber fronthaul using integrated microwave photonics filters,” IEEE Photonics Technol. Lett. 29(19), 1663–1666 (2017). [CrossRef]

2. D. Kedar, D. Grace, and S. Arnon, “Laser nonlinearity effects on optical broadband backhaul communication links,” IEEE Trans. Aerosp. Electron. Syst. 46(4), 1797–1803 (2010). [CrossRef]

3. A. M. J. Koonen and M. G. Larrodé, “Radio-over-mmf techniques—part ii: Microwave to millimeter-wave systems,” J. Lightwave Technol. 26(15), 2396–2408 (2008). [CrossRef]

4. Z. Cao, J. Yu, M. Xia, Q. Tang, Y. Gao, W. Wang, and L. Chen, “Reduction of intersubcarrier interference and frequency-selective fading in ofdm-rof systems,” J. Lightwave Technol. 28(16), 2423–2429 (2010). [CrossRef]

5. Z. Cao, J. Yu, H. Zhou, W. Wang, M. Xia, J. Wang, Q. Tang, and L. Chen, “Wdm-rof-pon architecture for flexible wireless and wire-line layout,” J. Opt. Commun. Netw. 2(2), 117–121 (2010). [CrossRef]

6. S. J. Savory, “Digital coherent optical receivers: Algorithms and subsystems,” IEEE J. Sel. Top. Quantum Electron. 16(5), 1164–1179 (2010). [CrossRef]

7. Y. Wang, L. Tao, X. Huang, J. Shi, and N. Chi, “Enhanced performance of a high-speed wdm cap64 vlc system employing volterra series-based nonlinear equalizer,” IEEE Photonics J. 7(3), 1–7 (2015). [CrossRef]

8. Y. Cui, M. Zhang, D. Wang, S. Liu, Z. Li, and G.-K. Chang, “Bit-based support vector machine nonlinear detector for millimeter-wave radio-over-fiber mobile fronthaul systems,” Opt. Express 25(21), 26186–26197 (2017). [CrossRef]

9. C.-Y. Chuang, L.-C. Liu, C.-C. Wei, J.-J. Liu, L. Henrickson, W.-J. Huang, C.-L. Wang, Y.-K. Chen, and J. Chen, “Convolutional neural network based nonlinear classifier for 112-gbps high speed optical link,” Opt. Fiber Commun. Conf. p. W2A.43 (2018).

10. J. He, J. Lee, T. Song, H. Li, S. Kandeepan, and K. Wang, “Recurrent neural network (rnn) for delay-tolerant repetition-coded (rc) indoor optical wireless communication systems,” Opt. Lett. 44(15), 3745–3748 (2019). [CrossRef]

11. N. Chi, Y. Zhao, M. Shi, P. Zou, and X. Lu, “Gaussian kernel-aided deep neural network equalizer utilized in underwater pam8 visible light communication system,” Opt. Express 26(20), 26700–26712 (2018). [CrossRef]

12. Z. Wan, J. Li, L. Shu, M. Luo, X. Li, S. Fu, and K. Xu, “Nonlinear equalization based on pruned artificial neural networks for 112-gb/s ssb-pam4 transmission over 80-km ssmf,” Opt. Express 26(8), 10631–10642 (2018). [CrossRef]

13. T. Wang, C. Wang, X. Zhou, and H. Chen, “A survey of FPGA based deep learning accelerators: Challenges and opportunities,” CoRR abs/1901.04988 (2019).

14. E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh, “Can fpgas beat gpus in accelerating next-generation deep neural networks?” (ACM, 2017), FPGA ’17, pp. 5–14.

15. S. Lohani, E. M. Knutson, M. O’Donnell, S. D. Huver, and R. T. Glasser, “On the use of deep neural networks in optical communications,” Appl. Opt. 57(15), 4180–4190 (2018). [CrossRef]

16. J. Zhao, Y. Sun, H. Zhu, Z. Zhu, J. E. Antonio-Lopez, R. A. Correa, S. Pang, and A. Schülzgen, “Deep-learning cell imaging through Anderson localizing optical fiber,” Adv. Photonics 1(06), 1–12 (2019). [CrossRef]

17. A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” Neural Inf. Process. Syst. 25 (2012).

18. A. Shawahna, S. M. Sait, and A. El-Maleh, “Fpga-based accelerators of deep learning networks for learning and classification: A review,” CoRR abs/1901.00121 (2019).

19. H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, “Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, (ACM, 2019), MICRO '52, p. 754–768.

20. Q. Sun, T. Chen, J. Miao, and B. Yu, “Power-driven dnn dataflow optimization on fpga,” in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), (2019), pp. 1–7.

21. Y. Ma, Y. Cao, S. Vrudhula, and J. sun Seo, “Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, (ACM, 2017), pp. 45–54.

22. Li Huimin, Fan Xitian, Jiao Li, Cao Wei, Zhou Xuegong, and Wang Lingli, “A high performance fpga-based accelerator for large-scale convolutional neural networks,” in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), (2016), pp. 1–9.

23. K. Xu, X. Wang, and D. Wang, “A scalable opencl-based fpga accelerator for yolov2,” in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), (2019), p. 317.

24. J. Lee, J. He, Y. Wang, C. Fang, and K. Wang, “Experimental demonstration of millimeter-wave radio-over-fiber system with convolutional neural network (cnn) and binary convolutional neural network (bcnn),” arXiv preprint arXiv:2001.02018 (2020).

25. A. F. Agarap, “Deep learning using rectified linear units (relu),” CoRR abs/1803.08375 (2018).

26. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization”, arXiv preprint arXiv:1412.6980 (2014).

27. K. Wang, J. Yang, G. Shi, and Q. Wang, “An expanded training set based validation method to avoid overfitting for neural network classifier,” in 2008 Fourth International Conference on Natural Computation, (2008), pp. 83–87.

28. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks,” in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, (ACM, 2015), FPGA '15, p. 161–170.

29. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR abs/1502.03167 (2015).

30. L. Yang, Z. He, and D. Fan, “A fully onchip binarized convolutional neural network fpga impelmentation with accurate inference,” in Proceedings of the International Symposium on Low Power Electronics and Design, (ACM, 2018).

31. H. Yonekawa and H. Nakahara, “On-chip memory based binarized convolutional deep neural network applying batch normalization free technique on an fpga,” in 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), (2017), pp. 98–105.

FPGA platform	BRAM	DSP	FF	LUT
Arty-7(XC7A35T)	50	90	41,600	20,800
VC709(XC7VX690T)	2940	3600	866,400	433,200

Neural Network	Freq. (MHz)	Latency (Sec.)	Power(W)	Efficiency Index	Device
CNN1	100	606.1 $μ$	3.6872	-	VC709
CNN2	100	87.1 $μ$	4.7289	3.03	VC709
CNN3	100	333.8 $μ$	3.7972	15.06	VC709
CNN1	83	1,091 $μ$	1.5246	-	Arty-7
CNN3	83	432.8 $μ$	1.5565	28.83	Arty-7
CNN	667	60.4m	2.6531	2.31	Cortex-A9
CNN	975	2.3m	29	0.98	GPU(nVidia M5000M)

Neural Network	Freq. (MHz)	Latency (Sec.)	Power(W)	Efficiency Index	Device
BCNN1	100	18.08m	3.7114	-	VC709
BCNN2	100	1.95m	5.6331	1.72	VC709
BCNN3	100	9.79m	3.7422	55.25	VC709
BCNN	667	135.8m	2.6476	2.24	Cortex-A9
BCNN	975	2.5m	27	3.39	GPU(nVidia M5000M)

FPGA platform	BRAM	DSP	FF	LUT
Arty-7(XC7A35T)	50	90	41,600	20,800
VC709(XC7VX690T)	2940	3600	866,400	433,200

Neural Network	Freq. (MHz)	Latency (Sec.)	Power(W)	Efficiency Index	Device
CNN1	100	606.1 $μ$	3.6872	-	VC709
CNN2	100	87.1 $μ$	4.7289	3.03	VC709
CNN3	100	333.8 $μ$	3.7972	15.06	VC709
CNN1	83	1,091 $μ$	1.5246	-	Arty-7
CNN3	83	432.8 $μ$	1.5565	28.83	Arty-7
CNN	667	60.4m	2.6531	2.31	Cortex-A9
CNN	975	2.3m	29	0.98	GPU(nVidia M5000M)

FPGA-based neural network accelerators for millimeter-wave radio-over-fiber systems

Abstract

1. Introduction

2. FPGA-based CNN and BCNN hardware accelerators for mm-wave RoF systems

2.1 mm-wave RoF system with CNN and BCNN decision schemes

2.2 FPGA-based neural network hardware accelerator for mm-wave RoF systems

2.3 Inner parallel computation optimization for FPGA-based CNN and BCNN hardware accelerators in mm-wave RoF systems

2.4 Experimental setup

2.5 Results and discussions

3. Conclusions

Disclosures

References

Cited By

Figures (5)

Tables (5)

Equations (6)

Optics Express

Neural Network	Freq. (MHz)	DSP	BRAM (18Kb)	Logic	Device
CNN1	100	15	48	40.1K	VC709
CNN2	100	355	48.5	109.5K	VC709
CNN3	100	30	51	43.1K	VC709
CNN1	83	15	48	22.6K	Arty-7
CNN3	83	26	44.5	26.4K	Arty-7

Neural Network	Freq. (MHz)	DSP	BRAM (18Kb)	Logic	Device
BCNN1	100	5	84.5	28.8K	VC709
BCNN2	100	19	156	113.8K	VC709
BCNN3	100	5	86	40K	VC709