dc.contributor.author | Rajasekaran, Sudarsanan | |
dc.contributor.author | Narang, Sanjoli | |
dc.contributor.author | Zabreyko, Anton A. | |
dc.contributor.author | Ghobadi, Manya | |
dc.date.accessioned | 2024-12-19T16:41:09Z | |
dc.date.available | 2024-12-19T16:41:09Z | |
dc.date.issued | 2024-11-18 | |
dc.identifier.isbn | 979-8-4007-1272-2 | |
dc.identifier.uri | https://hdl.handle.net/1721.1/157894 | |
dc.description | HOTNETS ’24, November 18–19, 2024, Irvine, CA, USA | en_US |
dc.description.abstract | This paper argues that congestion control protocols in machine learning datacenters sit at a sweet spot between centralized and distributed flow scheduling solutions. We present MLTCP, a technique to augment today's congestion control algorithms to approximate an interleaved centralized flow schedule. At the heart of MLTCP lies a straight-forward principle based on a key conceptual insight: by scaling the congestion window size (or sending rate) based on the number of bytes sent at each iteration, MLTCP flows eventually converge into a schedule that reduces network contention. We demonstrate that MLTCP uses a gradient descent trend with a step taken at every training (or fine-tuning) iteration towards reducing network congestion among competing jobs. | en_US |
dc.publisher | ACM|The 23rd ACM Workshop on Hot Topics in Networks | en_US |
dc.relation.isversionof | 10.1145/3696348.3696878 | en_US |
dc.rights | Creative Commons Attribution | en_US |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | en_US |
dc.source | Association for Computing Machinery | en_US |
dc.title | MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning | en_US |
dc.type | Article | en_US |
dc.identifier.citation | Rajasekaran, Sudarsanan, Narang, Sanjoli, Zabreyko, Anton A. and Ghobadi, Manya. 2024. "MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning." | |
dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | en_US |
dc.identifier.mitlicense | PUBLISHER_CC | |
dc.eprint.version | Final published version | en_US |
dc.type.uri | http://purl.org/eprint/type/ConferencePaper | en_US |
eprint.status | http://purl.org/eprint/status/NonPeerReviewed | en_US |
dc.date.updated | 2024-12-01T08:53:35Z | |
dc.language.rfc3066 | en | |
dc.rights.holder | The author(s) | |
dspace.date.submission | 2024-12-01T08:53:36Z | |
mit.license | PUBLISHER_CC | |
mit.metadata.status | Authority Work and Publication Information Needed | en_US |