Once a model has been developed, it must be deployed in order to be used. While deployed models can take many forms, typically the model is embedded directly into application code or put behind an API of some sort. HTTP-based (i.e. REST or gRPC) APIs are increasingly used so that developers can access model predictions as microservices. While model training might require large bursts of computing hardware over the course of several hours, days, or weeks, model inference—making queries against models—can be even more computationally expensive than training over time. Each inference against a deployed model requires a small but not insignificant amount of computing power. Unlike the demands of training, the computational burden of inference scales with the number of inferences made and continues for as long as the model is in production.
Meeting the requirements of inference at scale is a classic system engineering problem. Addressing issues like scalability, availability, latency, and cost are typically primary concerns. When mobile or edge deployment is required, additional considerations like processing cycles, memory, size, weight, and power consumption come into play.
At a certain point, we often end up needing to turn to distributed systems to meet our scalability or performance goals. This of course brings along its own set of challenges and operational considerations. How do we get our model onto multiple machines and ensure consistency over time? How do we update to new models without taking our applications out of service? What happens when problems arise during an upgrade? How can we test new models on live traffic?
Fortunately for those deploying models, much progress has been made addressing these questions for microservices and other software systems.