Developing your Machine Learning Platform Strategy

Definition: What is a Machine Learning Platform?

In order to help enterprise machine learning, data science, and AI innovators understand how model-driven enterprises are successfully scaling machine learning, we have conducted numerous interviews on the topic on the podcast (See Volume One and Volume Two) as well as at the annual TWIMLcon: AI Platforms Conference. (Note that the 2021 sessions are online and you can purchase On-Demand Access here.)

The key observation motivating and confirmed by these interviews is that organizations that have successfully scaled ML and AI share a number of characteristics in common. Most notably, they’ve all invested significantly in building out platform technologies to accelerate the delivery of machine learning models within their organizations. These efforts have resulted in making machine learning more accessible to more teams in the organization, ensuring greater degrees of consistency and repeatability, and addressing the “last mile” of getting models into production and managing them once they are in place.

It’s our belief that effective platforms are key to delivering ML and AI at scale. These platforms support data science and ML engineering teams by allowing them to innovate more quickly and consistently. Before continuing, let’s try to define what we mean when we use the term “Machine Learning Platform.”

MACHINE LEARNING PLATFORM: A set of tools and technologies (backed by a set of practices and processes) established by an organization to support and automate various aspects of the machine learning workflow, including data acquisition, feature and experiment management, and model development, deployment, and monitoring.

Machine learning platforms come in a wide variety of forms. Until recently, they have primarily been found at large technology companies, which have developed their platforms internally out of necessity, to support increasingly significant investments in machine learning. As the importance of machine learning has become clear to a broader array of enterprises, new commercial and open source ML platform technologies have become available to reduce the barriers to adoption and make the benefits of ML models more accessible.

Creating your Machine Learning Platform strategy

So far in this blog post series here on the Solution Guide, we’ve discussed the importance of developing a model-driven orientation for your enterprise, and how that creates the need to be able to deliver models to production rapidly and consistently. We also explored some examples of the internal platforms that leading model-driven companies have developed to help them achieve this goal. From these, we identified a set of common capabilities that define the modern machine learning platform.

The next question becomes: where do we go from here? How can you use this research to develop an actionable plan for building out your own enterprise machine learning platform? Or, more ambitiously, for helping your enterprise achieve model-driven excellence?

Here, we present seven steps or considerations that will help you develop your organization’s ML platform strategy. By documenting the application of these considerations to your enterprise, you will be well down the road towards articulating an ML platform strategy for your organization.
1. Know your why
Organizations that have deployed ML platforms cite a wide variety of benefits that help justify their investment. Conversely, by starting from the benefits that are most important to your organization, you will quickly get a better idea about which platform capabilities and characteristics will be most important for you. Here are a few of the benefits that come up most often:

1. Know your why

Organizations that have deployed ML platforms cite a wide variety of benefits that help justify their investment. Conversely, by starting from the benefits that are most important to your organization, you will quickly get a better idea about which platform capabilities and characteristics will be most important for you. Here are a few of the benefits that come up most often:

Abstraction. In small data science organizations, the same data scientists are responsible for all aspects of machine learning. As the organization matures, more specialized roles evolve and it becomes important to provide for a separation of concerns. At scale, for example, we don’t want data scientists or ML engineers dealing with setting up the infrastructure upon which their models are trained and run. These tasks are best handled by ML infrastructure specialists. Platforms provide a leverage point for enforcing this separation of concerns.
Agility. Agility speaks to the speed with which an organization is able to innovate and adapt to changing needs or new trends. In a fast-changing, model-driven world, there are always more models needed than the organization has the capacity to produce. Furthermore, those models that have been developed tend to degrade over time. By accelerating their ability to get new models into production, ML platforms help enterprises become and remain more competitive.
Automation. Many aspects of the enterprise machine learning process are well-defined, repetitive, and exacting. These are ideal candidates for automation, which helps ensure greater throughput and consistency. With repetitive tasks reliably delegated to the ML platform, data scientists and developers can focus their attention and intellect on the more valuable challenges of problem and solution definition.
Consistency or repeatability. Consistency and repeatability throughout training and production are key requirements for any mission-critical ML use cases and are key benefits offered by ML platforms. ML platforms can help ensure that the correct data is available to models both in training and in production, that the model in production is indeed the model the organization thinks is in production, and that the data seen in production is similar enough to the data that the model was trained against.
Democratization. As organizations embrace the goals of becoming more model-driven, it can be expedient to widen the circle of individuals empowered to build machine learning models. ML platforms help democratize machine learning by hiding much of the incidental complexity of model training and deployment.
Governability. In enterprise environments, governance requirements such as data security, privacy, reproducibility, explainability, fairness, and compliance will regularly come into play. Leaving it to each team or project to determine how to solve these problems is inefficient, wasteful, and goes against many of the fundamental tenets of governance. Your ML platform can provide a structure within which these teams can operate, helping to ensure greater governability.
Performance. Models that perform well on a given task are the goal of a data scientist’s experimentation and the desired outcome of the modeling process as a whole. Performance is often the difference between a model that makes it into production and one that does not. By allowing teams to more easily find and access the data and features needed for models, automating model selection and hyperparameter tuning, and monitoring in-production models for performance degradation, ML platforms can help teams build and maintain high-performing models.
Productivity. Because of the highly iterative nature of machine learning, each step in the process that can be accelerated or eliminated has a huge impact on cycle time and the ability of data scientists and developers to quickly get their models into production. ML platforms also reduce the cognitive load on data scientists and machine learning engineers, allowing them to focus on the aspects of the ML process that are most critical while delegating everything else to the platform. Furthermore, once they’ve determined the right way to tackle a particular problem, a platform allows them to automate the solution so that they no longer need to worry about it.
Scalability. ML platforms help teams scale in many ways. They help them scale training, decreasing the amount of time it requires to produce a performant model, and inference, allowing applications to make more predictions more consistently. More importantly though, ML platforms help teams scale their own output in terms of the number of models they are able to produce in a given time, and their capacity to deliver and take advantage of machine learning.
Visibility. Platforms help teams and managers gain much-needed visibility into machine learning models in development and production. They provide a centralized resource for collecting and sharing information on the data, features, experiments, and models that these teams work with so that insights about them can be more easily gained and shared.

2. Organize for success

Almost universally, the model-driven organizations that we’ve interviewed have established dedicated “ML platform” or “ML infrastructure” teams to help make their data scientists and ML developers more productive.

Setting up a platform team generally happens after there is a need for multiple data science efforts to be supported simultaneously. When this occurs, ML platform teams are established to drive efficiency and ensure that both scientists and developers have ready access to the tools and resources they need to work efficiently.

Airbnb, for example, established its ML infrastructure team as demand for ML models grew across a broad set of teams. The platform team’s mission is to eliminate what they call the incidental complexity of machine learning, as opposed to its inherent complexity, thus making machine learning more accessible to the company’s various developers.

While the establishment of ML platform or infrastructure teams is a trend that is just beginning within enterprises, most organizations have a recent internal example to look to. In many ways, ML infrastructure teams are to data scientists and ML developers what DevOps and developer platform teams are to traditional enterprise developers. Just as the latter have become a popular way to support software developers and ensure their productivity, so will the former be to their machine learning counterparts.

3. Understand your users

Once a platform team is established, its first task is to understand its users. Each of the platform’s potential users—business leaders, business analysts, subject matter experts, researchers, data scientists, machine learning engineers, software developers, and IT — will typically have different skill sets, needs, and tools.

Data scientists, to generalize, will be very comfortable with statistical tools and Jupyter notebooks but are often less comfortable with tools like version control systems that are commonplace in the software development world. Conversely, software engineers will be comfortable with a variety of crude command-line tools but might benefit from a system that offers them a selection of pre-built models to choose from.

It will be up to your organization to understand what those users need and how to get it to them without forcing data scientists to become container infrastructure experts or making data scientists out of your software developers. The key is to let everybody play to their strengths and your platform should support each of them to do their jobs the way they want, while in the service of the shared team-wide objectives.

4. Consider Build vs. Buy

Organizations seeking to establish ML platforms for their data scientists and ML engineers should carefully consider their options before deciding to build a proprietary solution.

The “build” approach, while highly customized to the needs of the organization, is expensive and requires strong engineering talent and teams to develop and maintain the platform. The “buy” option, on the other hand, often requires adapting to a given vendor’s approach but demands less time and expertise on the part of the customer.

The reality is that it’s not “build vs. buy”, it’s “Build, Borrow, Buy, and Open-Source!”. While Facebook, LinkedIn, and Airbnb have each invested in dedicated engineering teams to build and maintain their own proprietary ML platforms, in the case of Facebook’s FBLearner, the entire platform is largely built from scratch. At Airbnb, on the other hand, the company’s platform engineering team made liberal use of existing open source tools like Jupyter, Docker, Kubernetes, Airflow, and Spark in the creation of its platform. LinkedIn’s platform is arguably somewhere in the middle, based on many complex custom subsystems while taking advantage of the Hadoop and Spark ecosystems.

We believe most enterprises will ultimately compose their ML platforms from commercial, open source, or cloud-delivered software, along with custom integration and custom-coded modules as needed to address their unique needs.

5. Explore available solutions

As you may have gathered from the examples in the previous section, the machine learning tools market is growing quickly and there are many companies—startups and established vendors alike—that offer products and projects that may be of use to your organization as you build out your ML platform. So many, in fact, that the market can be quite confusing, with many of these vendors making similar and opaque claims. Understanding the market landscape will help you identify the right tools and prospective partners for your company.

Wide vs. Deep

One of the most interesting and important distinctions among the various tools available to help you build out your organization’s machine learning platform is whether the tool aims to be wide or deep:

Wide/Generalist. Wide refers to generalist tools that seek to provide end-to-end support for various aspects of the ML workflow. Wide offerings aim to give users a broad platform-in-a-box experience.
Deep/Specialist. Deep refers to specialist tools that seek to solve one problem deeply. These tools typically have robust APIs and are designed to easily fit into an organization’s existing ML workflow.

Wide/Generalist

Deep/Specialist

Pros

Easiest way to establish an ML platform; quickest path to ML platform benefits
Tightly integrated toolset requires little custom integration
Common control plane simplifies management & governance
One source for support

Best-in-class functionality and/or performance
Flexible; coexists and easy to integrate with existing tools, workflows, and decision
Greater control over the key decisions driving platform user experience

Cons

Individual tools may be shallow in functionality or performance
All-or-nothing; may have to abandon existing investments
Beholden to vendor’s roadmap and priorities for enhancements and bug fixes
Greater risk of lock-in

Lacks unified management & governance
Must be integrated with existing workflows/systems
Multiple vendor relationships to manage; end-user organization shoulders greater support burden

Table 1: Wide vs. deep tools / pros and cons

Wide vs. deep presents both buyers and vendors with an interesting paradox. Specialist tools by their nature tend to assume that users have enough of a pipeline or platform in place that the tool is easily slotted-in and can quickly demonstrate value. Often, however, specialist tool vendors find that customers have broader gaps in their workflow that prevent them from taking advantage of the tool. As a result, these companies, often small startups, find that their sales cycles are longer than anticipated as they (or the buyer) work to fill workflow gaps with custom integration.This leads specialist vendors towards expanding their offerings to take on more of the end-to-end pipeline problem so as to more quickly demonstrate value in immature customer environments.

Generalist tools, on the other hand, face a different set of challenges. First, by the time a customer is mature enough in their data science journey that they’re ready to adopt an end-to-end platform, they’ve often already invested in building out one or more pieces of their own workflow. These customers or their users are often not too excited about needing to start from scratch with an unproven tool and throw out everything they’ve done. Second, the end-to-end problem is deceptively simple, but the existence of so many specialized vendors is an indication of the complexity, depth, and options inherent in many individual steps of the data science process. So the individual features of an end-to-end platform may not even be as capable, and are often not as tailored or differentiating, as the homegrown tools they seek to displace. Finally, as many vendors jump into the ring with shallow end-to-end offerings and aspirations to “own the data science pipeline,” end-to-end tools become increasingly commoditized. In order to differentiate, these tools will need to build depth or specialization in one or more individual areas in order to differentiate, thus heading in the opposite direction as the specialist tools.

The result of this paradox is the increasingly confusing market in which we find ourselves today, in which everyone markets themselves as an end-to-end platform and it’s up to the customer to figure out if and where any depth exists. Hopefully, in this discussion, it is clear that wide vs. deep should not be equated to good vs. bad or vice versa. Rather, what’s important is to realize that different technologies have different aims and that each organization will need to identify the best fit for its needs and choose accordingly.

Wide vs. deep is an important distinction, but far from the only one. Other important considerations include:

Target use case. Understanding the intended use case for a given tool is key to understanding how and where to best apply it. For example, Falkonry and Reality AI both offer modeling tools targeting manufacturing and industrial use cases. The former focuses on predictive operations applications while the latter targets embedded AI systems with a wide variety of sensor connections.
Target user profile. The needs and preferences of a business analyst vs. data scientist vs. ML engineer vs. platform engineer can vary widely. Different tools cater to different types of users in the decisions they make and features they offer. (See point 3, “Understand your users,” above).
Target model type. Platforms differ on the types of models that they target and support. For example, some platforms target traditional ML model types while others target deep learning. While they’ve since broadened their footprint, BigML originally only supported decision trees and random forests as model types. Some platforms are even more specific, targeting models built using a specific framework. TensorFlow Extended (TFX) is an example. Others target a specific use case. For example, Allegro and Neurala are platforms designed to help users create computer vision models.
Tool history/legacy. Two similarly marketed tools can take very different approaches, often informed by the founders’ backgrounds or the company’s technology legacy.
Open vs. closed source vs. SaaS vs. cloud. ML platforms are delivered in a wide variety of formats, including open and closed source software that teams can run in their own data centers, software that is supported running in any cloud, and SaaS software, whether supplied by one of the large cloud vendors or an independent firm.

6. Learn from DevOps efforts

We’ve alluded to this point in earlier posts, but it’s worth reiterating here. Our efforts to industrialize and scale machine learning mirror in many ways the parallel evolution that has taken place in software development over the past decade and much can be learned from both the process and the result.

Modern software development practices emphasize strong problem definition (user stories), tight iterative loops (sprints), high degrees of automation (CI/CD), high levels of repeatability (Docker containers), and robust platforms that provide developer services (PaaS) and manage the underlying infrastructure (container orchestration, Kubernetes).

Of course, the analogy isn’t perfect and can be taken too far. Still, your organization likely has learned a lesson or two about providing platform services to teams of developers, and these lessons can be applied to supporting your data scientists as well.

7. Start small

It would be hard to overemphasize the evolutionary nature of ML platform development and deployment. None of the organizations we’ve profiled here, or any of the others that we’ve talked to, have deployed their machine learning platform in a single “big bang.” Rather, each organization’s ML platform evolved in a unique way based on its needs, users, skill sets, organizational structure, and existing technology investments.

In this ebook we’ve identified a broad set of capabilities to be considered for supporting and accelerating machine learning in the enterprise. Your team does not need to build an entire end-to-end system for your efforts to be successful. Rather, teams should prioritize their efforts based on a careful analysis of your users’ specific needs and your organization’s ability to execute. Talking to your organization’s data scientists and ML engineers will likely yield a short-list of pain points that could be alleviated with off-the-shelf or custom-developed tools.

We look forward to being part of your organization’s Machine Learning journey. If you haven’t done so already, please subscribe to our mailing list so we can notify you when we publish new posts.