Networking Optimizations for Multi-Node Deep Learning on Kubernetes with Erez Cohen

800 800 The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Today we conclude our KubeCon ‘19 Series joined by Erez Cohen, VP of CloudX & AI at Mellanox.

We caught up with Erez before his talk “Networking Optimizations for Multi-Node Deep Learning on Kubernetes,” where he discusses problems and solutions related to networking discovered during the journey to reduce training time. In our conversation, we discuss NVIDIA’s recent acquisition of Mellanox, and what fruits that relationship hopes to bear. We also discuss the evolution of technologies like RDMA, GPU Direct, and Sharp, Mellanox’s solution to improve the performance of MPI operations, which can be found in NVIDIA’s NCCL collective communications library. Finally, we explore how Mellanox is enabling Kubernetes and other platforms to take advantage of the various technologies mentioned above, and why we should care about networking in Deep Learning, which is a compute-bound process.

About Erez

Mentioned in the Interview

  • Video: Networking Optimizations for Multi-Node Deep Learning on Kubernetes – Rajat Chopra & Erez Cohen
  • Infiniband
  • Distributed TensorFlow
  • Horovod
  • SHARP (Scalable Hierarchical Aggregation and Reduction Protocol)
  • NVIDIA Collective Communications Library (NCCL)
  • RDMA
  • IOMMU (Input–output memory management unit)
  • Check it out

    “More On That Later” by Lee Rosevere licensed under CC By 4.0

    Leave a Reply

    Your email address will not be published.