CoreLab Publications

Home People Publications

Refereed International Journal Publications

Peak-Memory-aware Partitioning and Scheduling for Multi-tenant DNN Model Inference [abstract] (ScienceDirect)
Jaeho Lee, Ju Min Lee, Haeeun Jeong, Hyunho Kwon, Youngsok Kim, Yongjun Park, and Hanjun Kim
Journal of Systems Architecture, January 2026.
IF=4.1, Q1 (JCR 2025)

As Deep Neural Networks (DNNs) are widely used in various applications, multiple DNN inference models start to run on a single GPU. The simultaneous execution of multiple DNN models can overwhelm the GPU memory with increasing model size, leading to unexpected out-of-memory (OOM) errors. To avoid OOM errors, existing systems schedule models at either the model-level or layer-level granularity. However, the model-level scheduling schemes inefficiently utilize memory spaces because they preallocate memory based on the model's peak memory demand, and the layer-level scheduling schemes suffer from high scheduling overhead due to fine-grained scheduling units. This work proposes a new peak-memory-aware DNN model partitioning compiler and scheduler, Quilt. The Quilt compiler partitions a DNN model into multiple tasks based on their peak memory usage, and the Quilt scheduler orchestrates the tasks of multiple models without the OOM errors. Additionally, the compiler generates a memory pool for tensors shared between partitioned tasks, reducing CPU-GPU communication overhead during consecutive task execution. Compared to the model-level and layer-level scheduling schemes, Quilt reduces overall latency by 25.4% and 37.7%, respectively, while preventing the OOM errors. Moreover, Quilt achieves up to 10.8% faster inference latency than the state-of-the-art Triton inference server for 6 DNN models.