TWIMLcon Day 2: The Secret Life of Production ML Systems

Day 2 (of 8!) of TWIMLcon: AI Platforms 2021 was a day of sharing hard-earned lessons.

(The conference started yesterday and runs through January 29, 2021. It’s not too late to join in! Use discount code GREATCONTENT for 25% off registration.)

We kicked off the day interviewing Faisal Saddiqi, Director of Engineering for Personalization Infrastructure at Netlix. Faisal has been at Netflix for a little over six years and he shared a ton of great lessons learned by him and his team while building out their internal ML platforms. This was a very dense discussion, full of hard lessons and good advice.

Some key take-aways that stood out:

Get clear on your internal users and what they need as your customer and then build systems that empower them to do their work with the tools they want to use.
Be both opinionated AND flexible. Use prescriptive approaches and technologies lower in the stack and where you need to maintain control and provide more flexibility up at higher levels where people need the room to innovate. Overall the discussion on structure vs. flexibility was worth the price of admission as it ties into usability of the platforms we’re all building.
Understand that components in your tech stack and MLOps platform are probably going to be mixed and matched between the four possibilities: build it (DIY), borrow it (from elsewhere in the company), use an open-source element, or use a commercial solution. He commented that Netflix and his team used all four options.

There was so much more in this conversation and it was such a great start to the day. I highly recommend going back and catching the replay of this episode.

Next up, we heard from Todd Underwood, an Engineering Director at Google. Todd walked us through how models fail and how to prevent it. He
probably made a lot of people feel both better and worse by starting off saying that model quality is a common production problem and that it’s both an operational (systems) problem and also a human trust problem. Basically, if they fail (and they will) and you don’t know why they fail, people are less likely to trust them.

From there, he walked through his lessons learned on how to think about failure as a gift, how to look past the obvious sources of failure to the more esoteric and boring causes, and how to learn from failures as an organization over time. From there, he walked through classes of failures and their frequent causes and then illustrated the principles he had laid out by walking through a particular story. While he had many interesting quotes in the presentation, I’ll put one of my favorite ones here:

“Understand YOUR system, and your system’s failures. It is worth doing. It pays dividends in better models, more resilience…if you don’t monitor model quality yet, start. If you don’t write and track post mortems yet, start. When you have an outage, make sure you learn everything you can from it. You’ve already paid for it, so you should get the value out of it.” – Todd Underwood, Engineering Director • Google

Overall, his presentation was a call to action to embrace failure and accept it as a part of building complex systems generally, and AI/ML systems in particular. This talk is worth sharing with your whole team and taking action on.

After that tough love talk, we got to hear from Ariel Biller, an Evangelist at ClearML and his customer Dotan Asselman, Co-Founder and CTO of theator. The talk continued on with the Build vs. Buy debate that Faisal touched on in the morning keynote. Spoiler alert: Both of them agreed on the core insight of this talk:

“It’s not build vs. buy – it’s build AND buy and that golden ratio is use-case specific. ‘Buy’ also means open-source – remember that it may be ‘free’ but it has associated support costs.” Ariel Biller, Evangelist • ClearML

What was really great about this talk was that they outlined the end-to-end ML platform system at theator and walked through which components were BUILT and which were BOUGHT. More importantly, they explained the thinking behind those decisions. I won’t get into the details here as you can check out the replay until the end of the conference (or after the conference for the Pro Plus Passes or Executive Summit pass holders). I encourage you to check out the full presentation.

As we got into the thick of the day, Chip Huyen, the author of the excellent MLOps Tooling Landscape, let the audience know that ML is going real-time and that they’re probably not prepared for it. (What a day of tough love around here!) Chip’s core message was that organizations needed to move beyond thinking about real-time vs batch, but rather consider “online learning.”

“Online learning is crucial for systems to adapt to rare events… Because Black Friday happens only once a year, there’s no way Amazon or other ecommerce sites can get enough historical data to learn how users are going to behave that day, so their systems need to continually learn on that day to adapt.”

She then discussed how the two pipeline architecture that many systems have (a batch based pipeline for training and a streaming data pipeline for inference) is a common source of production failure and that teams should be looking at ways to unify those into a common pipeline that does both. Overall, she made a compelling case for rethinking the status quo of system architectures and considering whether online learning should be a goal for your system design. As with the others above, we can’t really do it justice here: check out the replay!

To continue on with the themes of systems and their components, Monte Zweben, CEO of Splice Machine shared his thoughts on feature stores – what they are, what they do, and how they’re traditionally deployed in a three database architecture alongside scale-out operational databases and scale-out analytical data platforms. From there, he explained how Splice Machine has unified the three functions in one open-source system, to help customers deliver features much faster, and simplify the lifecycle of ML models. He made a argument for a database-centric approach to MLOps and I’d encourage any of you wrestling with the complexities feature management and delivery to go chat with Monte and his co-presenter Jack Ploshnick here at the conference this week to learn more.

The second last session of the day was a fun panel discussion with a bunch of the Spotify ML team who shared their thoughts on how to drive platform adoption within their broader company. A key takeaway from this discussion was that “if you build it, they will come” is not enough at a certain level of scale. Spotify created a new “engagement manager” role on its platform team to address this, with a focus on evangelizing the platform to Spotify teams, and helping them be successful. There were lots of lessons in this chat for anybody building and evangelizing an internal ML platform.

Finally, we closed out the day with a workshop presented by John Posada, a Partner Solutions Architect at Dataiku. Echoing what Ariel Biller discussed earlier in the day, John discussed how technical debt builds up, for example as regulatory frameworks change, and if you’re not agile enough your AI systems can fall afoul of the regulatory environment, causing customer and business harm. This is the stuff that keeps your risk management team up at night (and probably your CEO as well.) He suggested that the key is to use modular platforms that let you evolve the elements in the system while not changing the whole system, adding a layer of governance and building in guardrails to ensure fair and responsible use of AI. Dataiku’s answer to all of these requirements is their Data Science Software (DSS) platform and John presented a thorough walk through of how it can be used to create end-to-end MLOps pipelines.

Tomorrow, we will be changing things up by shifting the focus to two major workshops, plus a networking session:

David Hershey, a Solutions Architect for Tecton AI will walk through an entire case study in how to deploy a Fraud Detection Model with their Feature Store
More networking (with a twist!)
Kristopher Overholt, a Solution Engineer from Algorithmia, will demonstrate how to move models from training into production.

Friday, our Executive Summit sessions will be happening, and then the regular mix of technical sessions will pick up again on Tuesday January 26th.

If this sounds interesting, it’s not too late to register! There are still six more days of sessions, including Friday’s Executive Summit. Pro Plus and Executive passes provide ongoing access to the conference recordings so that you can catch up after the event. You can check out the agenda here and the speakers here.

Thanks to all of today’s speakers Faisal Siddiqi, Todd Underwood, Dotan Asselman, Ariel Biller, Chip Huyen, Monte Zweben, Maya Hristakeva, Lex Beattie, Maisha Lopa, Samuel Ngahane, and John Posada for their time and contributions to a great day of learning.

TWIMLcon Day 2: The Secret Life of Production ML Systems

Related Articles

From 1 to 100+ ML Models in Four Years

Architectural Patterns in ML

Building Agility and Velocity In Machine Learning From The Ground Up

TWIMLcon Day 5: Architecting ML Systems for Inevitable Change

Key Factors When Building a Global Data Science Team