This one-day, intermediate-level workshop provides students with knowledge that might be helpful when building and working with Juniper Apstra™ in an artificial intelligence data center (AI data center). This workshop will provide attendees with the background knowledge necessary to understand the usage of the backend graphic processing unit (GPU) network described in the Juniper Validated Design (JVD) titled AI Data Center Network with Juniper Apstra, NVIDIA GPUs, and WEKA Storage. Students will learn to train AI models using the PyTorch framework on:
a single server with multiple GPUs (covering NVIDIA’s NVSwitch); and
multiple servers with each having multiple GPUs.
Students will gain familiarity with network interface cards (NICs) for AI (NVIDIA ConnectX-7 and Broadcom P2200G), Nvidia H100 GPUs, and a compute platform architecture (NVIDIA DGX H100). Students will be provided with an overview of the NVIDIA-focused JVD for the AI data center. In the case of the back-end GPU network, students will learn that using NVIDIA Collective Communication Library (NCCL), remote direct memory access (RDMA) over Converged Ethernet (RoCEv2), and a rail-optimized network design ensures an optimal communication path for the collective operations of NCCL. Students will learn how to use both data center quantized congestion notification (DCQCN) and dynamic load balancing (DLB) to ensure lossless data transfer over an Ethernet-based network. Students will learn how to use Apstra to deploy the backend GPU network as well as orchestrate the training cluster using Slurm.
Through lectures only, students will gain knowledge in deploying and training AI models in a DC based on the JVD titled AI Data Center Networks with Juniper Apstra, NVIDIA GPUs, and WEKA Storage.
Note: Want more? A three-day version of this workshop is available, with expanded coverage of topics such as front-end and storage networks, automation with Terraform, and validated designs. View details