I. Introduction
Graph Convolutional Network (GCN) has become popular solutions for many cloud-based applications, such as e-commerce [1] and recommendation systems [2]. Most GCN applications like recommendation systems are deployed on cloud. To achieve real-time performance, GCN acceleration has been studied on application-specific integrated circuit (ASIC) [3] and GPU platform [4]. FPGAs in the cloud become a promising solution in terms of performance, energy efficiency and flexibility. There are several challenges of deploying GCN on cloud-based FPGAs: (1) Heterogeneity of GCN workload: There are two major computation kernels in GCN [5]: aggregation and transformation. The aggregation kernel is used for graph traversal, and involves large number of irregular memory accesses. On the other hand, the transformation kernel involves regular neural network computation, such as multilayer perceptron (MLP). Thus, GCN acceleration needs to efficiently utilize external memory bandwidth as well as achieve massive computation parallelism. (2) Time to market: While GCNs are widely used, their models evolve rapidly [6]–[8]. RTL-based accelerators [3], [9] are hard to adapt to new GCN models and require significant development effort. HLS-based kernel design can be easily adapted to evolving GCN models, but requires careful optimizations to achieve high performance. (3) Architectural constraints: FPGAs contain massive on-chip resources. They are suitable for GCN acceleration, which requires massive memory bandwidth and computation parallelism. However, state-of-the-art FPGAs usually consist of multi-die with limited inter-die wire connections. The on-chip resources, such as memory ports, block RAMs and DSPs are unevenly distributed into different dies [10]. Thus, placing a large design of GCN on state-of-the-art FPGAs frequently causes PnR failures and timing violations.