Many ML teams have evolved from simply trying to get models to work, to ensuring that they work in a way that meets the needs of the organization. This means building processes and systems that allow them to be produced and delivered efficiently, hardened against failure, and robust to those failures that inevitably occur.
ML and MLOps practitioners have much to learn from the evolution of DevOps in this regard, and particularly the evolution and application of site reliability engineering (SRE) in that field, which sought to apply engineering discipline to the challenges of operating large-scale, mission-critical software systems.
In this live podcast interview, Sam speaks with Google SRE practitioners and authors Todd Underwood and Niall Murphy about the application of SRE to MLOps.