For the last decade, as deep learning has become prominent, practitioners have been focused on accumulating as much data as possible, labeling it, preparing it for use, and then iterating on model architectures and hyperparameters in order to achieve our desired performance levels. Dealing with all this data, while long recognized as being tedious and time consuming, has traditionally been considered an up-front, one-time step we execute before the critical modeling phase of deep learning. Data quality issues, label noise, model drift, and other biases are all handled the same way—by gathering and labeling even more data, then proceeding with more model iterations.
The above approach has worked fairly well for tackling an organization’s most strategic problems, or at those organizations for which resources are not an issue (i.e. the Facebooks and Googles of the world). It does not, however, translate very well to the long-tail of problems that machine learning can help solve, especially those with fewer users and less available training data.
The realization that the dominant approach to deep learning does not “scale down” to the many problems faced by today’s enterprises has given rise to a new “movement” in the field called “Data-Centric AI”,
Ng makes the case that we can build targeted, industry-specific AI applications by having subject matter experts curate smaller datasets of high-quality training data and using this to fine-tune any of the wide variety of off-the-shelf models developed in the last decade.
It is important to note that data-centric AI is not a technique or even a collection of techniques. It is rather a rough idea about how organizations should allocate their time and resources, and a call for greater systemization and discipline around the data aspects of the ML pipeline.
So how does an organization become more data-centric? Read on for some ideas and be sure to check out the TWIML Data-Centric AI Podcast Series for in-depth interviews with industry practitioners.
Evolving practices for efficient data labeling
The availability of labeled data is one of the key impediments for organizations looking to take advantage of machine learning. Data-centric AI is in large part a response to the significant labeling costs incurred by those attempting to use the standard model-centric approach to machine and deep learning.
There’s no playbook just yet for implementing data-centric AI, but there are some common themes we see across organizations trying to get a handle on the cost and inefficiency of data labeling. We’ve grouped them into a few high level plays:
Whether you are labeling in-house, outsourcing to contractors, or crowd-sourcing with pieceworkers, labeling is ultimately a people-driven task and it becomes expensive to do at scale, particularly if there are consensus mechanisms in place to ensure label quality such as having multiple labelers label an individual data instance and cross-checking the answers. Multiply that by the exponentially growing volumes of data coming from IoT devices, cameras, lidar, radar, mobile devices, etc. and it becomes clear why teams are looking at alternatives. One promising approach put forward by data-centric AI advocates is exactly the opposite of the traditional advice—rather than just collect and label more data, label less data by identifying the right data to label in the first place. Specific approaches you might hear about include:
Data Curation: Not all of your data is equally valuable in terms of helping to train a machine learning mode. Rather, by curating your data you can have subject matter experts (SMEs) or machines identify the best, rarest, or most instructive examples and create a smaller dataset consisting only of them. Curation attempts to weed out the useless or even harmful data, leaving only the data of highest value for model training. In a nutshell, this is about reducing your total amount of data, which has the benefit of reducing labeling costs, storage costs, and model training time and costs.
For example, consider a scenario in which your team is training a system to identify defects in semiconductor chips and you have 10,000 images of chips to work with. What if your team removed all of the redundant (duplicate) photos, irrelevant photos (camera pointing off the conveyor belt at a window), low-quality (pictures washed out by afternoon sun through a window), corrupted (noisy) photos and kept just enough photos to clearly demonstrate good-quality chips and high quality defect examples? You would have a lot less labeling to do, and would lower your overall project costs.
Active Learning: An alternative approach is to put high-quality labels on a small subset of the data and then, by executing a number of active learning loops, have the model tell your team which other instances of data need labels for it to learn optimally. In this case your team is making a trade-off. In return for not having to label all of the available data, there will be a number of active learning cycles where additional data will need to be labeled. Active learning adds complexity and isn’t always applicable, but when it is it’s a great opportunity to apply machine learning to the challenge of curating your data.
Human-in-the-Loop Labeling: What if you’ve curated your data and you still have a lot of it? Let’s say you have a computer-vision based vehicle insurance application that looks at photos of damaged (insured) vehicles, and then provides claim payouts that will be enough to fix the car but not so much that your company is losing money by overpaying for repairs. In this scenario you have collected 2 million images of vehicle damage. Even if you removed 1 million of them for being useless or harmful to the training, you’re still working with 1 million images.
This could be a good use case for human-in-the-loop or hybrid labeling, where machines can label the “easy” data (like a broken window or dented door) and route “hard to label” data (e.g., a crushed front-end) to human labelers, for example based on the predicted class or the model’s confidence. Those subject matter experts (e.g., insurance underwriters or adjusters) would determine the proper payout for a particular example, which becomes a label for that instance of data in the dataset.
Programmatic Labeling: Another approach is to skip manual labeling or human-in-the-loop labeling altogether and use an approach based on programmatic labeling or weak supervision to accelerate the labeling process. Programmatic labeling leverages an organization’s SMEs and existing knowledge to create rules or heuristics that are used to generate labels for new data. Weak supervision is an approach to model training that recognizes that these machine-generated labels will be noisier than is typically expected of “ground truth” data. This approach, which is being pioneered by companies like Snorkel and Watchful, can radically accelerate labeling relative to manual or even human-in-the-loop approaches.
Consider the example of a Fortune 500 biotechnology company that needed to extract chronic disease data from clinical trial reports. That data was embedded in 300,000 documents. By using programmatic labeling, the company was able to save millions of dollars of labeling costs, and applied labels to the documents with 99% accuracy when compared to humans. They were able to label 300,000 documents in less than one day. They estimated the same effort would have taken up to a year if done by hand.
Instead of labeling the documents manually, the biotech’s team built labeling rules based on the expertise of their in-house subject matter experts and then applied those labels at scale to the full corpus of documents. Any errors in the rules that apply labels were revised quickly and the documents thenbe relabeled in hours, not months.
Programmatic labeling may help reduce labeling time and costs, allow organizations to transfer their subject matter expertise into the labeling process more effectively, and make labeling easier to audit, manage, and govern.
Don’t label at all
Finally, consider opportunities to build training datasets without using labeling at all. People think of the revolution in natural language processing over the past few years as being primarily a success of the transformer model (a model-centric view of the world) but a data-centric world view recognizes that it was equally a success of self-supervision. Modern large language models aren’t trained using manually generated labels or a complex predictive process, but rather by simply blanking out words in text and using those known words as prediction targets during training.
Another technique along the same lines is synthetic data generation. Here we generate training data from the classes or data that we want to predict.
Picture an autonomous vehicle manufacturer looking at their existing data and realizing that they have a gap in data coverage. In particular, they are missing black-out and storm conditions that don’t frequently occur in their natural data (fortunately, drivers avoid these conditions, yet they exist and must be covered). They might task a synthetic data provider to provide 1,000 hours of simulated camera data that represents their cars, at night, in rain and fog, or driving through a city with the power out, filled with emergency vehicles.
The synthetic data generation company will then build a virtual environment with something like a gaming engine and every single object placed into that environment is already a “known known” (it is an object placed into the gaming engine environment by the system). Once the environment is built, the vendor will “drive” a simulated car with virtual cameras and lidar or radar sensors (matching the make, model, and position of the ones on the real-world vehicle) through the simulated environment. Essentially, the data—images, audio, video frames, point clouds—and the labels—bounding boxes and other metadata associated with objects like “pedestrian”, “ambulance”, “dog”, and “lane” placed in the environment—are created at the same time.
Depending on how you count, there are at least 50 new startups in the synthetic data space. These companies can create everything from statistically “ideal” tabular datasets for banking or fintech, to the creation of image, video, audio, and 3d point cloud data for training autonomous vehicles.
Of course, with self-supervised learning and synthetic data it’s not so much that there are NO labels, but rather that your team (or its ML models) does not need to create any, because they are either derived or generated 100% consistently from the data itself.
It’s an exciting time to be building AI systems. Data-centric approaches are making more use cases and applications viable across more industries. Teams are starting to recognize that data quality beats data quantity which is leading them to curate, and thus reduce, their data more aggressively. In addition, new approaches to labeling (or not) are being made increasingly accessible: labeling only a subset of the data (active learning), collaborating with machines on labeling the data (human-in-the-loop machine labeling), creating labels programmatically, or generating new data that is automatically labeled (with synthetic data generation). In short, teams are seeing the value in taking a more data-centric approach to developing machine learning systems.
To learn more about data-centric AI, be sure to check out these additional TWIML resources:
- The TWIML Data-Centric AI Podcast Series. We’re digging deep into DCAI through a series of podcast interviews with practitioners from organizations including Airbus, Google, Shopify, Toyota, Watchful, and more.
- Data-Centric AI Panel Discussion. Join us for a lively and informative discussion with data-centric AI practitioners from a variety of industries.
- TWIMLcon: AI Platforms 2022. Data-centric AI will be a hot topic at our next TWIMLcon: AI Platforms conference which focuses on MLOps and the platforms, tools, technologies, and practices necessary to enable and scale enterprise machine learning and AI.