Once data has been acquired and some initial visualization and exploration have been completed, the training data must be prepared for model development. This is known as Data Preparation and Transformation, or sometimes just “Data Wrangling.” This step is all about restructuring, cleaning up, enriching, validating, and potentially publishing the cleaned-up data. Transformations may also be required to extract labels. For example, developing a model that predicts the likelihood of churn among customers will require a label indicating which of the customers in our transactional database are examples of churn. This can, in turn, require a complex query against the data warehouse that considers factors such as the products or services that we are basing the prediction on, the number of days without a transaction, the window in which we want to make predictions, and more.
It can also include standardization of formats (e.g. “California”, “CA”), deduplication, conversions (e.g. metric to imperial or currency to currency), breaking data up into bins or buckets (e.g. ages <19, 20-29, 30-39, over 40); validating data (e.g. finding outliers or even incorrect data like birthdates in the future or too far in the past); and backfilling (imputing missing data.)