Distributed Model Training

Distributed Model Training

For smaller ML models, training can be reasonably done on the data scientist’s desktop. As organizations begin working with more sophisticated models such as deep learning, training against larger and larger datasets, or enforcing more sophisticated data access controls, distributed training in a centralized environment becomes increasingly important. Distributed training frees up the data scientist desktop and provides a centralized nexus for management and control.

There are a variety of use cases such as computer vision and natural language processing that can benefit from distributed computing. Sometimes the datasets are too large to run on a single computer. In other cases, the models themselves are too large to run in memory on a single GPU or CPU. In either case, the data scientists and machine learning engineers will then find themselves trying to build, manage and optimize a distributed computing environment that lets them either split (shard) the data or split the model, or both. The problem with this is that distributed computing is inherently a challenging task and manually trying to manage data and model splitting can take weeks of experimentation. Additionally, splitting models poorly can result in extreme GPU underutilization.

Any sort of system that assists with data and model splitting and distribution across a fleet of machines will likely provide some or all of these features:

  • A mechanism for orchestrating the distributed compute and storage infrastructure;
  • A mechanism for splitting (sharding) data, distributing it to the fleet, and recapturing the results;
  • A mechanism for smartly splitting up the model in a way that maximizes GPU utilization while reducing unnecessary GPU communication.
SAS Visual Data Mining and Machine Learning
Solve the most complex analytical problems with a single, integrated, collaborative solution
RapidMiner Studio
One platform, does everything
An enterprise-grade platform for agile, reproducible, and scalable machine learning
Modern MLOps focused on speed and simplicity
Weights & Biases
With a few lines of code, save everything you need to debug, compare and reproduce your models
Spell is DLOps
Determined AI
Build models, not infrastructure
Build better models faster
Industry-leading AI OS for machine learning
Google Vertex AI
Fully managed, end-to-end platform for data science and machine learning