|
Home
People
Publications
|
|
Refereed International Journal PublicationsPeak-Memory-aware Partitioning and Scheduling for Multi-tenant DNN Model Inference [abstract] (ScienceDirect)
As Deep Neural Networks (DNNs) are widely used in various applications, multiple DNN inference models start to run on a single GPU.
The simultaneous execution of multiple DNN models can overwhelm the GPU memory with increasing model size, leading to unexpected out-of-memory (OOM) errors.
To avoid OOM errors, existing systems schedule models at either the model-level or layer-level granularity.
However, the model-level scheduling schemes inefficiently utilize memory spaces because they preallocate memory based on the model's peak memory demand, and the layer-level scheduling schemes suffer from high scheduling overhead due to fine-grained scheduling units.
This work proposes a new peak-memory-aware DNN model partitioning compiler and scheduler, Quilt.
The Quilt compiler partitions a DNN model into multiple tasks based on their peak memory usage, and the Quilt scheduler orchestrates the tasks of multiple models without the OOM errors.
Additionally, the compiler generates a memory pool for tensors shared between partitioned tasks, reducing CPU-GPU communication overhead during consecutive task execution.
Compared to the model-level and layer-level scheduling schemes, Quilt reduces overall latency by 25.4% and 37.7%, respectively, while preventing the OOM errors.
Moreover, Quilt achieves up to 10.8% faster inference latency than the state-of-the-art Triton inference server for 6 DNN models.
|