Scaling Model Training with Kubernetes at Stripe with Kelley Rivoire

800 800 The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Today we’re joined by Kelley Rivoire, engineering manager working on machine learning infrastructure at Stripe.

Kelley and I caught up at a recent Strata Data conference where she presented the talk “Scaling model training: From flexible training APIs to resource management with Kubernetes.” In our conversation, we discuss Stripe’s machine learning infrastructure journey, including their start from a production focus as opposed to focusing on answering internal business questions. Kelley also details a few of their internal tools including Railyard, an API built to manage model training at scale. Finally, we discuss how the end users dealt with the shift to event-based, streaming models.

Join us at TWIMLcon: AI Platforms

Today we’re super excited to announce the launch of our inaugural conference: TWIMLcon: AI Platforms! TWIMLcon: AI Platforms will focus on the platforms, tools, technologies, and practices necessary to scale the delivery of machine learning and AI in the enterprise.

You already know TWIML for bringing you dynamic, practical conversations via the podcast, and we’re creating our TWIMLcon events to build on that tradition. The event will feature two full days of community-oriented discussions, live podcast interviews, and practical presentations by great presenters sharing concrete examples from their own experiences.

By creating a space where data science, machine learning, platform engineering, and MLOps practitioners and leaders can share, learn and connect, the event aspires to help seed the development of an informed, sustainable community of technologists that is well equipped to meet the current and future needs of their organizations.

Some of the topics we plan to cover include:

  • Overcoming the barriers to getting machine learning and deep learning models into production
  • How to apply MLOps and DevOps to your machine learning workflow
  • Experiences and lessons learned in delivering platform and infrastructure support for data management, experiment management, and model deployment
  • The latest approaches, platforms, and tools for accelerating and scaling the delivery of ML and DL in the enterprise
  • Platform deployment stories from leading companies like Google, Facebook, Airbnb, and traditional enterprises like Comcast and Shell
  • Organizational and cultural best practices for success

The two-day event will be held on October 1st and 2nd at the Mission Bay conference center in San Francisco, and I’d really love to meet you there! Early bird registration is open today at twimlcon.com and we are offering the first 10 listeners who register the amazing opportunity to get their ticket for 75% off using the discount code TWIMLFIRST!

Thanks to our Sponsor!


I’m really grateful to our friends over at SigOpt who stepped up to support this project in a big way. In addition to supporting our AI Platforms podcast series and next ebook, they’ve made a huge commitment to this community by signing on as the first Founding Sponsor for the event. SigOpt’s software is used by enterprise teams to standardize and scale machine learning experimentation and optimization across any combination of modeling frameworks, libraries, computing infrastructure and environment. Teams like Two Sigma, who we’ll hear from later in this podcast series, rely on SigOpt’s software to realize better modeling results much faster than previously possible. Of course, to fully grasp its potential it is best to try it yourself. This is why SigOpt is offering you an exclusive opportunity to try their product on some of your toughest modeling problems for free. To learn about and take advantage of this offer, visit twimlai.com/sigopt!

About Kelley

Mentioned in the Interview

“More On That Later” by Lee Rosevere licensed under CC By 4.0

Leave a Reply

Your email address will not be published.