CoreLab Publications

Home People Publications

Theses

Decoupling Scheduling and Storage Formats for Balanced Graph Processing on a GPU [abstract] (PDF)
Shinnung Jeong
Ph.D. Thesis, School of Electronical and Electronic Engineering, Yonsei University, February 2025.

Only with a proper schedule and a proper storage format, a graph algorithm can be efficiently processed on GPUs. Existing GPU graph processing frameworks try to find an optimal schedule and storage format for an algorithm via iterative search, but they fail to find the optimal configuration because their schedules and storage formats are tightly coupled in their processing models. Moreover, their tightly coupled schedules and storage formats make it difficult for developers to extend the tuning space. However, implementing all possible combinations of the schedule and storage format would impose a substantial implementation burden. Therefore, clear decoupling is essential to efficiently explore optimization combinations efficiently.

This dissertation aims to enlarge the tuning space by decoupling the three key components in graph processing on a GPU, such as algorithm, schedule, and storage format. This dissertation begins by analyzing the characteristics of the existing optimizations, presenting a fundamental abstraction interface for each key component, and suggesting a new processing model. Furthermore, this research proposes GRAssembler, a new GPU graph processing framework that efficiently integrates the decoupled schedule, storage format, and algorithm without abstraction overhead. Finally, the research enhances the coverage, composability, extendability, and modularity of GPU graph processing.

Furthermore, this research demonstrates that decoupled abstractions not only integrate existing studies but also help to uncover new opportunities for performance improvements focusing on workload imbalance in GPU graph processing. First, this research shows that new optimization can be proposed by considering the operation of the abstraction interface. By focusing on the role of storage format and deciding memory access patterns, this research proposes a new storage format called CR2, aligned to the characteristics of both graphs and GPUs. Second, this research shows that new hardware acceleration opportunities can be revealed by analyzing the abstract methods of existing studies. Based on observations of runtime overhead in existing schedules because of the lack of hardware support, this research proposes a new lightweight GPU functional unit microarchitecture called SparseWeaver that converts sparse operations into dense operations to accelerate schedule. CR2 and SparseWeaver are designed based on the behavior of the abstraction interface and enhance the tuning space.

Leveraging efficient decoupling and integration, GRAssembler significantly expands the tuning space from 336 to 4,480, resulting in a 1.3 times speedup over the state-of-the-art GPU graph processing framework. CR2 achieves a 1.53 times performance boost while reducing memory usage by 32.1% on average. SparseWeaver delivers a 2.49 times reduction in execution time compared to the existing approach, with a minimal area overhead of just 0.045%.