For smaller ML models, training can be reasonably done on the data scientist’s desktop. As organizations begin working with more sophisticated models such as deep learning, training against larger and larger datasets, or enforcing more sophisticated data access controls, distributed training in a centralized environment becomes increasingly important. Distributed training frees up the data scientist desktop and provides a centralized nexus for management and control.
There are a variety of use cases such as computer vision and natural language processing that can benefit from distributed computing. Sometimes the datasets are too large to run on a single computer. In other cases, the models themselves are too large to run in memory on a single GPU or CPU. In either case, the data scientists and machine learning engineers will then find themselves trying to build, manage and optimize a distributed computing environment that lets them either split (shard) the data or split the model, or both. The problem with this is that distributed computing is inherently a challenging task and manually trying to manage data and model splitting can take weeks of experimentation. Additionally, splitting models poorly can result in extreme GPU underutilization.
Any sort of system that assists with data and model splitting and distribution across a fleet of machines will likely provide some or all of these features: