Once a model has been developed, it must be deployed in order to be used. While deployed models can take many forms, typically the model is embedded directly into application code or put behind an API of some sort. Models in production are often more computationally expensive than training. Unlike the demands of training, the computational burden of inference scales with the number of inferences made and continues for as long as the model is in production. Meeting the requirements of inference at scale is a classic system engineering problem. Addressing issues like scalability, availability, latency, and cost are typically primary concerns.
Furthermore, if mobile or edge deployment is a goal, then you may need to compress (“quantize”) and translate (“adapt”) the model to run on smaller devices with different processors, lower power budgets, and less memory.
In addition to the fundamentals of running a high-uptime, low-latency, real-time prediction service, it is important to monitor and alert on both the accuracy and performance of the model in production. If the model becomes less accurate or fit for purpose (“drift”), then it may need to be retrained or even decommissioned. If it becomes resource-constrained (too much traffic for example), then it may need to be modified or the infrastructure supporting the model-serving may need to be scaled out. Both elements are critical for successful ongoing monitoring.
In short, deployment and operational tooling generally include some or all of the following elements: