Back in the fall of 2018, we conducted a series of interviews with some of the people behind the large-scale ML platforms at organizations like Facebook, Airbnb, LinkedIn, OpenAI and more. That series of interviews turned into the first volume of our AI Platforms podcast series, led to the publication of the The Definitive Guide to Machine Learning Platforms ebook, and ultimately to us launching the first TWIMLcon: AI Platforms conference in San Francisco the following fall.
The first of those interviews was with Aditya Kalro, an engineering manager at Facebook. Aditya walked us through FBLearner Flow, the home-grown machine learning platform used at the company.
Sam and Aditya recently reconnected for a webcast we held on The Evolution of Machine Learning Platforms at Facebook. Check out the replay and our highlights from the discussion below.
Beyond Model Training
In the early days, FBLearner was largely focused on model Training. The team had a strong appreciation for the importance of experimentation for data scientists and engineers and was really focused on building solid experiment management and collaboration capabilities into the tool.
Eventually, they realized a need to create infrastructure and tooling for the entire machine learning lifecycle. This meant that they had to design the platform to support everything from data ingestion to feature development to data preparation, all the way through to model serving.
“The big bang for the buck in AI is really data and features – we had zero tooling for it at the time and that had to change.” – Aditya Kalro
Aditya’s commentary on the importance of tooling and support for data labeling and feature management echoes what we’ve heard so far in our recent series of interviews and panel discussion on Data-Centric AI.
Investing in the Data Side of MLOps
The team invested in building out several features in support of the data side of MLOps. They added new workflows to support both manual (human) and automated (machine-only and human-in-the-loop) labeling. They also built a “feature store” which for them was a marketplace for features that anybody in the organization could discover and use.
ML Model Deployment Strategies
In addition to working on data, they also began to put a big focus on allowing users to easily find and use specialized hardware such as GPUs for distributed training as well as production inference.
Also on the deployment side, Aditya shared how they built a set of high level abstractions that allowed them to have rules such as “if Model 2 performs better than Model 1, then promote Model 2 to succeed Model 1.” We could probably do a whole talk just on the mechanics of challenger models, shadow models, and model promotion and rollback procedures. If you’d be interested in learning more about this topic hit reply below and let us know!
Applying DevOps Lessons to ML Model Development
Next, they worked on treating ML development more like the way their team was already handling software development. They built systems and processes to allow for faster model build and release cycles that could support faster retraining. They also implemented more monitoring and debugging tooling.
With more insight into the data and more trackability of the build systems, they were able to achieve more data and model lineage, that is, tracking where data came from and which models were using it in what experiments. All this contributed to better auditability, reproducibility, and governance.
As these processes became more systematic, they pulled in their security teams and worked on improving data security and isolation so that models only had access to the data they needed and nothing else.
Key Lessons Learned in Four Years of Platform Evolution
Early in his presentation, Aditya shared a few key design principles that guided the team in their journey:
- Reusability: Making the system, data, artifacts, and workflows more reusable and composable so that they could make use of prior work, and spend less time redoing work;
- Ease of use: They wanted to create tools that were easy to use, so they invested heavily in their APIs and UIs;
- Scale: They focused on creating infrastructure that allowed them to train, evaluate, and run experiments at scale.
To close out the talk, Aditya shared some of the key lessons learned through their platform evolution.
- ML platforms need to support the entire model development lifecycle.
- ML platforms must be “modular, not monolithic.”
- Standardizing data and features was critical to their success.
- Evolving your platform requires disrupting yourself. In their case, they did this by pairing infrastructure engineers with ML engineers which allowed them to continuously evolve the platform to better support their users.
Aditya answered a number of excellent audience questions about containerization, challenges of data standardization, supporting research vs. production teams, building your own tooling vs. leveraging open-source, an expanded discussion of their approach to labeling, and more.
We want to thank Aditya and Meta for coming on the webcast and we’ll look forward to another update soon!