Home   People   Publications  
 

Refereed International Conference Poster

Peak-Memory-aware Partitioning and Scheduling for Multi-tenant DNN Model Inference [abstract]
Jaeho Lee and Hanjun Kim
Proceedings of the 23rd ACM International Symposium on Code Generation and Optimization Student Research Competition (SRC) (CGO), March 2025.

Deep Neural Networks (DNNs) are widely used, often running multiple inference models on a single GPU. However, this can overwhelm GPU memory, causing out-of-memory (OOM) errors. Existing scheduling approaches either preallocate memory inefficiently at the model level or incur high overhead at the layer level. This work proposes Quilt, a peak-memory-aware DNN model partitioning compiler and scheduler. The Quilt compiler partitions models into tasks based on peak memory usage, while the scheduler orchestrates tasks across models to prevent OOM errors. Additionally, a memory pool for shared tensors reduces CPU-GPU communication overhead. Quilt reduces overall latency by 25.4% and 37.7% compared to model- and layer-level scheduling, respectively, while preventing OOM errors. It also achieves up to 10.8% faster inference latency than the state-of-the-art Triton inference server.