|
Home
People
Publications
|
|
Refereed International Conference PosterPeak-Memory-aware Partitioning and Scheduling for Multi-tenant DNN Model Inference [abstract]
Deep Neural Networks (DNNs) are widely used, often running multiple inference models on a single GPU. However, this can overwhelm GPU memory, causing out-of-memory (OOM) errors. Existing scheduling approaches either preallocate memory inefficiently at the model level or incur high overhead at the layer level. This work proposes Quilt, a peak-memory-aware DNN model partitioning compiler and scheduler. The Quilt compiler partitions models into tasks based on peak memory usage, while the scheduler orchestrates tasks across models to prevent OOM errors. Additionally, a memory pool for shared tensors reduces CPU-GPU communication overhead. Quilt reduces overall latency by 25.4% and 37.7% compared to model- and layer-level scheduling, respectively, while preventing OOM errors. It also achieves up to 10.8% faster inference latency than the state-of-the-art Triton inference server.
|