Congestion Control for DNN training clusters

Narang, Sanjoli

Author(s)

Narang, Sanjoli

DownloadThesis PDF (2.399Mb)

Advisor

Ghobadi, Manya

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

The modern DNN workloads generate network traffic having striking differences with the conventional data-center traffic. DNN training jobs generate periodic traffic pattern where all subsequent flows depend on the completion of the currently running flow. Although this periodic behavior calls for a new non-conventional congestion control protocol for DNN training clusters, it also creates an unprecedented opportunity to approximate optimal schedule for DNN jobs in a distributed manner without requiring priority queues, centralized information, or switch hardware support. Prior work on MLTCP proposed updates to existing congestion control algorithms to make them capable of minimizing network congestion when DNN jobs compete for the network. In this thesis, we propose several techniques to expand the scope of prior work to support DNN jobs with more complex communication patterns or parallelization strategies, and further improve the performance speedup over TCP. With two straightforward ideas of updating the congestion control parameters, we expand the performance benefits of MLTCP to a wider set of periodic DNN jobs. Augmenting existing congestion control algorithms with MLTCP provides an effective guiding mechanism to a random search to find the optimal interleaved schedule for competing DNN jobs. Our contributions boost this guided search to improve performance further. We provide detailed theoretical analysis and extensive flow-level simulations to take a deep dive into the convergence, performance speedup, and fairness of MLTCP with the proposed changes.

Date issued

2025-02

URI

https://hdl.handle.net/1721.1/158950

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses