MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning
Author(s)
Rajasekaran, Sudarsanan; Narang, Sanjoli; Zabreyko, Anton A.; Ghobadi, Manya
Download3696348.3696878.pdf (1.163Mb)
Publisher with Creative Commons License
Publisher with Creative Commons License
Creative Commons Attribution
Terms of use
Metadata
Show full item recordAbstract
This paper argues that congestion control protocols in machine learning datacenters sit at a sweet spot between centralized and distributed flow scheduling solutions. We present MLTCP, a technique to augment today's congestion control algorithms to approximate an interleaved centralized flow schedule. At the heart of MLTCP lies a straight-forward principle based on a key conceptual insight: by scaling the congestion window size (or sending rate) based on the number of bytes sent at each iteration, MLTCP flows eventually converge into a schedule that reduces network contention. We demonstrate that MLTCP uses a gradient descent trend with a step taken at every training (or fine-tuning) iteration towards reducing network congestion among competing jobs.
Description
HOTNETS ’24, November 18–19, 2024, Irvine, CA, USA
Date issued
2024-11-18Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
ACM|The 23rd ACM Workshop on Hot Topics in Networks
Citation
Rajasekaran, Sudarsanan, Narang, Sanjoli, Zabreyko, Anton A. and Ghobadi, Manya. 2024. "MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning."
Version: Final published version
ISBN
979-8-4007-1272-2
Collections
The following license files are associated with this item: