Deepak Narayanan - Resource-Efficient Deep Learning Execution
Deep Learning models have enabled state-of-the-art results across a broad range of applications; however, training these models is extremely time- and resource-intensive, taking weeks on clusters with thousands of expensive accelerators in the extreme case. In this talk, I will describe two ideas that help improve the resource efficiency of model training.
In the first half of the talk, I will discuss how pipelining can be used to accelerate distributed training. Pipeline parallelism facilitates model training with lower communication overhead than previous methods while still ensuring high compute resource utilization. Pipeline parallelism also enables the efficient training of large models that do not fit on a single worker; for example, we used pipeline parallelism at Nvidia to efficiently scale training to language models with a trillion parameters on 3000+ GPUs.
In the second half of this talk, I will describe how resources in a shared cluster with heterogeneous compute resources (e.g., different types of hardware accelerators) should be partitioned among different users to optimize objectives specified over one or more training jobs. Heterogeneity-aware scheduling can improve various scheduling objectives, such as average completion time, makespan, or cloud computing resource cost, by up to 3.5x.