Home   People   Publications  
 

Refereed International Conference Publications

Accelerating LLMs using an Efficient GEMM library and Target-aware Optimizations on Real-world PIM Devices [abstract] (ACM)
Hyeonchoel Kim, Taehoon Kim, Taehyeong Park, Donghyeon Kim, Yongseung Yu, Hanjun Kim, and Yongjun Park
Proceedings of the 2025 International Symposium on Code Generation and Optimization (CGO), March 2025.

Real-time processing of deep learning models on conventional systems, such as CPUs and GPUs, is highly challenging due to memory bottlenecks. This is exacerbated in Large Language Models (LLMs), where the majority of executions are dominated by General Matrix Multiplication (GEMM) operations, which are relatively more memory-intensive than convolution operations. Processing-in-Memory (PIM), which provides high internal bandwidth, can be a promising alternative for LLM serving. However, since current PIM systems do not fully replace traditional memory, data transfer between the host and PIM-side memory is essential. Therefore, minimizing the transfer cost between the host and PIM is crucial for serving LLMs efficiently on the PIM. In this paper, we propose PIM-LLM, an end-to-end framework that accelerates LLMs using an efficient tiled GEMM library and several key target-aware optimizations on real-world PIM systems. We first propose PGEMMlib, which provides optimized tiling techniques for PIM, considering architecture specific characteristics to minimize unnecessary data transfer overhead and maximize parallelism. In addition, Tile-Selector explores optimized parameters and techniques for different GEMM shapes and available resources of PIM systems using an analytical model. To accelerate LLMs using PGEMMlib, we integrate it into the TVM deep learning compiler framework. We further optimize the LLM execution by applying several key optimizations: Build-time memory layout adjustment, PIM resource pooling, CPU/PIM cooperation support, and QKV generation fusion. Evaluation shows that PIM-LLM achieves significant performance gains of up to 45.75x over the TVM baseline for several well-known LLMs. We strongly believe that this work provides key insights for efficient LLM serving on real PIM devices.