Adaptive Network-Aware Scheduling Extension for Mixture-of-Experts Model Deployments in Kubernetes with RoCE and GPUDirect RDMA
Kubernetes clusters that utilise high-performance networking with RoCE and GPUDirect RDMA are widely employed for distributed multi-node inference deployments of modern deep learning models with Mixture-of-Experts architectures. These deployments typically involve inter-node data transfer and synchronisation over the network due to all-to-all communication operations that are generally required for inference in distributed Mixture-of-Experts models, which can incur substantial communication costs and subsequent inference performance penalties. To address these issues in clusters with dynamic workloads, this paper proposes a novel extension of Kubernetes for adaptive network-aware scheduling. Concretely, we steer scheduling decisions for multi-node inference deployments of Mixture-of-Experts models towards reduction of potential communication costs through our graph-based prioritisation of worker nodes. It uses custom networking metrics continuously captured along the end-to-end inter-node GPU-to-GPU data path. Our approach increases inference throughput by over 14% for model deployments with the DeepSeek Mixture-of-Experts architecture.
Top
- Filich, Aleksandr
- Bajrovic, Enes
- Benkner, Siegfried
Top
Category |
Paper in Conference Proceedings or in Workshop Proceedings (Paper) |
Event Title |
The 40th International Conference on Advanced Information Networking and Applications (AINA-2026) |
Divisions |
Scientific Computing |
Subjects |
Informatik Allgemeines Angewandte Informatik |
Event Location |
Wellington, New Zealand |
Event Type |
Conference |
Event Dates |
8-10 April 2026 |
Series Name |
Proceedings of the 40-th International Conference on Advanced Information Networking and Applications (AINA-2026) |
Date |
2026 |
Export |
Top
