Start / AWS re:Invent 2019 / Cmp304 r1 aws infrastructure for large scale training at facebook ai

CMP304-R1: AWS infrastructure for large-scale training at Facebook AI

55 min • 8 december 2019

In this session, the Facebook AI team discusses its major machine learning models and workloads and the infrastructure challenges it faced with large-scale distributed training. They share details of how they tested these ML workloads on AWS infrastructure and the results of this benchmarking. Then we discuss how the deep breadth of AWS infrastructure for ML workloads in compute, networking, and storage helps address large-scale ML challenges. Specifically, we dive deep into the AWS machine learning stack to choose the right Amazon EC2 platform to fit your ML workload while leveraging 100 Gbps networking and high-performance file systems to efficiently scale from a single GPU to hundreds or thousands.

Kategorier

Poddar Teknologi

Förekommer på

Teknik

00:00 -00:00