MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning

Rajasekaran, Sudarsanan; Narang, Sanjoli; Zabreyko, Anton A.; Ghobadi, Manya

Author(s)

Rajasekaran, Sudarsanan; Narang, Sanjoli; Zabreyko, Anton A.; Ghobadi, Manya

Download3696348.3696878.pdf (1.163Mb)

Publisher with Creative Commons License

Terms of use

Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/

Metadata

Show full item record

Abstract

This paper argues that congestion control protocols in machine learning datacenters sit at a sweet spot between centralized and distributed flow scheduling solutions. We present MLTCP, a technique to augment today's congestion control algorithms to approximate an interleaved centralized flow schedule. At the heart of MLTCP lies a straight-forward principle based on a key conceptual insight: by scaling the congestion window size (or sending rate) based on the number of bytes sent at each iteration, MLTCP flows eventually converge into a schedule that reduces network contention. We demonstrate that MLTCP uses a gradient descent trend with a step taken at every training (or fine-tuning) iteration towards reducing network congestion among competing jobs.

Description

HOTNETS ’24, November 18–19, 2024, Irvine, CA, USA

Date issued

2024-11-18

URI

https://hdl.handle.net/1721.1/157894

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

ACM|The 23rd ACM Workshop on Hot Topics in Networks

Citation

Rajasekaran, Sudarsanan, Narang, Sanjoli, Zabreyko, Anton A. and Ghobadi, Manya. 2024. "MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning."

Version: Final published version

ISBN

979-8-4007-1272-2

Collections

MIT Open Access Articles

The following license files are associated with this item:

Creative Commons