Networking Optimizations for Multi-Node Deep Learning on Kubernetes with Erez Cohen

EPISODE 345

February 5, 2020

LISTEN

Banner Image: Erez Cohen - Podcast Interview

Join our list for notifications and early access to events

About this Episode

Today we conclude our KubeCon ‘19 Series joined by Erez Cohen, VP of CloudX & AI at Mellanox. We caught up with Erez before his talk "Networking Optimizations for Multi-Node Deep Learning on Kubernetes," where he discusses problems and solutions related to networking discovered during the journey to reduce training time. In our conversation, we discuss NVIDIA's recent acquisition of Mellanox, and what fruits that relationship hopes to bear. We also discuss the evolution of technologies like RDMA, GPU Direct, and Sharp, Mellanox's solution to improve the performance of MPI operations, which can be found in NVIDIA's NCCL collective communications library. Finally, we explore how Mellanox is enabling Kubernetes and other platforms to take advantage of the various technologies mentioned above, and why we should care about networking in Deep Learning, which is a compute-bound process.