Congestion Control for DNN training clusters
Author(s)
Narang, Sanjoli
DownloadThesis PDF (2.399Mb)
Advisor
Ghobadi, Manya
Terms of use
Metadata
Show full item recordAbstract
The modern DNN workloads generate network traffic having striking differences with the conventional data-center traffic. DNN training jobs generate periodic traffic pattern where all subsequent flows depend on the completion of the currently running flow. Although this periodic behavior calls for a new non-conventional congestion control protocol for DNN training clusters, it also creates an unprecedented opportunity to approximate optimal schedule for DNN jobs in a distributed manner without requiring priority queues, centralized information, or switch hardware support. Prior work on MLTCP proposed updates to existing congestion control algorithms to make them capable of minimizing network congestion when DNN jobs compete for the network. In this thesis, we propose several techniques to expand the scope of prior work to support DNN jobs with more complex communication patterns or parallelization strategies, and further improve the performance speedup over TCP. With two straightforward ideas of updating the congestion control parameters, we expand the performance benefits of MLTCP to a wider set of periodic DNN jobs. Augmenting existing congestion control algorithms with MLTCP provides an effective guiding mechanism to a random search to find the optimal interleaved schedule for competing DNN jobs. Our contributions boost this guided search to improve performance further. We provide detailed theoretical analysis and extensive flow-level simulations to take a deep dive into the convergence, performance speedup, and fairness of MLTCP with the proposed changes.
Date issued
2025-02Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology