Networking Optimizations for Multi-Node Deep Learning on Kubernetes with Erez Cohen

EPISODE 345
|
FEBRUARY 5, 2020
Watch
Banner Image: Erez Cohen - Podcast Interview
Don't Miss an Episode!  Join our mailing list for episode summaries and other updates.

About this Episode

Today we conclude our KubeCon ‘19 Series joined by Erez Cohen, VP of CloudX & AI at Mellanox. We caught up with Erez before his talk "Networking Optimizations for Multi-Node Deep Learning on Kubernetes," where he discusses problems and solutions related to networking discovered during the journey to reduce training time. In our conversation, we discuss NVIDIA's recent acquisition of Mellanox, and what fruits that relationship hopes to bear. We also discuss the evolution of technologies like RDMA, GPU Direct, and Sharp, Mellanox's solution to improve the performance of MPI operations, which can be found in NVIDIA's NCCL collective communications library. Finally, we explore how Mellanox is enabling Kubernetes and other platforms to take advantage of the various technologies mentioned above, and why we should care about networking in Deep Learning, which is a compute-bound process.

About the Guest

Erez Cohen

Mellanox

Connect with Erez

Resources

Related Topics