Show simple item record

dc.contributor.advisorGhobadi, Manya
dc.contributor.authorNarang, Sanjoli
dc.date.accessioned2025-03-27T17:00:02Z
dc.date.available2025-03-27T17:00:02Z
dc.date.issued2025-02
dc.date.submitted2025-03-04T17:28:59.724Z
dc.identifier.urihttps://hdl.handle.net/1721.1/158950
dc.description.abstractThe modern DNN workloads generate network traffic having striking differences with the conventional data-center traffic. DNN training jobs generate periodic traffic pattern where all subsequent flows depend on the completion of the currently running flow. Although this periodic behavior calls for a new non-conventional congestion control protocol for DNN training clusters, it also creates an unprecedented opportunity to approximate optimal schedule for DNN jobs in a distributed manner without requiring priority queues, centralized information, or switch hardware support. Prior work on MLTCP proposed updates to existing congestion control algorithms to make them capable of minimizing network congestion when DNN jobs compete for the network. In this thesis, we propose several techniques to expand the scope of prior work to support DNN jobs with more complex communication patterns or parallelization strategies, and further improve the performance speedup over TCP. With two straightforward ideas of updating the congestion control parameters, we expand the performance benefits of MLTCP to a wider set of periodic DNN jobs. Augmenting existing congestion control algorithms with MLTCP provides an effective guiding mechanism to a random search to find the optimal interleaved schedule for competing DNN jobs. Our contributions boost this guided search to improve performance further. We provide detailed theoretical analysis and extensive flow-level simulations to take a deep dive into the convergence, performance speedup, and fairness of MLTCP with the proposed changes.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleCongestion Control for DNN training clusters
dc.typeThesis
dc.description.degreeS.M.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Science in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record