To build high-quality models, data scientists and machine learning engineers must have access to large quantities of high-quality labeled training data. This data very rarely exists in a single place or in a form directly usable by data scientists. Rather, in the vast majority of cases, the training dataset must be built by data scientists, data engineers, machine learning engineers, and business domain experts working together. In the real world, as opposed to in academia or Kaggle competitions, the creation of training datasets usually involves combining data from multiple sources. For example, a data scientist building a product recommendation model might build the training dataset by joining data from web activity logs, search history, mobile interactions, product catalogs, and transactional systems.
Organizations with modern data warehouses, data lakes, or data fabrics are at a significant advantage when it comes to scaling their ability to deliver ML projects. Without centralized data or data access, finding and gaining access to the data needed to build machine learning models can consume a great deal of time and effort. On the other hand, when data is centralized, teams no longer need to navigate or search multiple systems to access the data they need for their projects. As a result, any efforts resulting in simplified data access tend to be a force-multiplier on data science resources. Furthermore, a data warehouse can also be a useful place to store transformed data and features, facilitating reuse across projects and teams. Note that the line between the modern data warehouse and the data lake continues to blur.