Adaptive Network-Aware Scheduling Extension for Mixture-of-Experts Model Deployments in Kubernetes with RoCE and GPUDirect RDMA

Content

Abstract
Authors
Shortfacts

Abstract

Kubernetes clusters that utilise high-performance networking with RoCE and GPUDirect RDMA are widely employed for distributed multi-node inference deployments of modern deep learning models with Mixture-of-Experts architectures. These deployments typically involve inter-node data transfer and synchronisation over the network due to all-to-all communication operations that are generally required for inference in distributed Mixture-of-Experts models, which can incur substantial communication costs and subsequent inference performance penalties. To address these issues in clusters with dynamic workloads, this paper proposes a novel extension of Kubernetes for adaptive network-aware scheduling. Concretely, we steer scheduling decisions for multi-node inference deployments of Mixture-of-Experts models towards reduction of potential communication costs through our graph-based prioritisation of worker nodes. It uses custom networking metrics continuously captured along the end-to-end inter-node GPU-to-GPU data path. Our approach increases inference throughput by over 14% for model deployments with the DeepSeek Mixture-of-Experts architecture.

Top

Authors

Filich, Aleksandr
Bajrovic, Enes
Benkner, Siegfried

Top

Shortfacts

Category	Paper in Conference Proceedings or in Workshop Proceedings (Paper)
Event Title	The 40th International Conference on Advanced Information Networking and Applications (AINA-2026)
Divisions	Scientific Computing
Subjects	Informatik Allgemeines Angewandte Informatik
Event Location	Wellington, New Zealand
Event Type	Conference
Event Dates	8-10 April 2026
Series Name	Proceedings of the 40-th International Conference on Advanced Information Networking and Applications (AINA-2026)
Date	2026
Export

Top