Adaptive Network-Aware Scheduling Extension for Mixture-of-Experts Model Deployments in Kubernetes with RoCE and GPUDirect RDMA

Adaptive Network-Aware Scheduling Extension for Mixture-of-Experts Model Deployments in Kubernetes with RoCE and GPUDirect RDMA

Abstract

Kubernetes clusters that utilise high-performance networking with RoCE and GPUDirect RDMA are widely employed for distributed multi-node inference deployments of modern deep learning models with Mixture-of-Experts architectures. These deployments typically involve inter-node data transfer and synchronisation over the network due to all-to-all communication operations that are generally required for inference in distributed Mixture-of-Experts models, which can incur substantial communication costs and subsequent inference performance penalties. To address these issues in clusters with dynamic workloads, this paper proposes a novel extension of Kubernetes for adaptive network-aware scheduling. Concretely, we steer scheduling decisions for multi-node inference deployments of Mixture-of-Experts models towards reduction of potential communication costs through our graph-based prioritisation of worker nodes. It uses custom networking metrics continuously captured along the end-to-end inter-node GPU-to-GPU data path. Our approach increases inference throughput by over 14% for model deployments with the DeepSeek Mixture-of-Experts architecture.

Grafik Top
Authors
  • Filich, Aleksandr
  • Bajrovic, Enes
  • Benkner, Siegfried
Grafik Top
Shortfacts
Category
Paper in Conference Proceedings or in Workshop Proceedings (Paper)
Event Title
The 40th International Conference on Advanced Information Networking and Applications (AINA-2026)
Divisions
Scientific Computing
Subjects
Informatik Allgemeines
Angewandte Informatik
Event Location
Wellington, New Zealand
Event Type
Conference
Event Dates
8-10 April 2026
Series Name
Proceedings of the 40-th International Conference on Advanced Information Networking and Applications (AINA-2026)
Date
2026
Export
Grafik Top