Home   People   Publications  
 

Theses

GPU/FPGA-based Hybrid Platform for Accelerating LLM Inference [abstract] (PDF)
Hyunjun Park
Master's Thesis, School of Electronical and Electronic Engineering, Yonsei University, February 2024.

This paper proposes a hybrid platform for accelerating the LLM inference based on the characteristics of LLM operations and hardware platforms (GPU, FPGA). The LLM can be divided into two stages, the Summarization Stage and the Generation Stage. The Summarization Stage extracts context information from input tokens, and the Generation Stage computes the output tokens based on the context information. The Summarization Stage has a high parallel opportunity because it can process input tokens in parallel. On the contrary, the Generation Stage is consists of sequential operations because the current output token has dependencies on the previous output token. Most current AI models uses NVIDIA's GPUs because they shows great performance in various metrics. While the summary stage, bottlenecked by computational load, benefits greatly from the high parallelism of GPUs, but the generation stage, bottlenecked by memory bandwidth, does not fully utilize the computational capabilities of GPUs. Some researches are attmpting to optimize them at the hardware level. Notably, DFX [6] designed an architecture optimized for LLM operations, achieving faster processing in generation stages under certain conditions using FPGA, compared to GPU. This paper begins from this point. We thought the hybrid environment that uses both GPU and FPGA can operate faster than the conventional homogeneous systems if the summary stage is assigned to GPU and the generation stage is assigned to FPGA. To construct this platform, we followed n steps. First, we established using NVIDIA's A10 and XILINX's U55C, and selected the FasterTransformer [5] and DFX [6] codes to control each device. Second, we divided LLM operations into tasks to be executed on the GPU and FPGA, while analyzing what data need to be transferred between heterogeneous hardwares. Third, we implemented reshaping and communication kernels to transfer intermediate data from GPU to FPGA. Fourth, we uses Latency Hiding technique to reduce communication overhead. Finally, for scalability, we developed an API with c++ and python wrapper to implement a Hybrid Platform. To verify the performance of this platform, we implemented experiments with various sizes of input and output tokens. Compared to operations performed on a single device, the Hybrid Platform demonstrated up to 1.56 times faster processing for substantial input and output tokens. We can find prior researches related to GPU-FPGA hybrid systems. Hype-training [12] and FARNN [13] are aimed to optimize the training of CNN and Transformer models, and Walther [15] and FleetRec [16] are aimed to optimized inference for CNN and recommendation systems. All of them uses GPU-FPGA heterogeneous hardware, but there was no prior research to accelerate LLM inference using heterogeneous hardware. Our research holds significance in this point. Additionally, we designed hybrid platforms with scalability. So it not only allows for adaptation to new models and options on a micro level, but also enables research expansion to multiple hardware and servers on a macro level. We anticipate significant value will be created from extending this research to data center-scale cluster platforms.