1.2.4 Prepare Data
The preparation phase involves finalizing the data and covers all the tasks involved in making the data in a format that you can use to build the model.
Data preparation tasks are likely to be performed multiple times, iteratively, and not in any prescribed order. Tasks can include column (attributes) selection as well as selection of rows in a table. You may create views to join data or materialize data as required, especially if data is collected from various sources. To cleanse the data, look for invalid values, foreign key values that don't exist in other tables, and missing and outlier values. To refine the data, you can apply transformations such as aggregations, normalization, generalization, and attribute constructions needed to address the machine learning problem. For example, you can transform a DATE_OF_BIRTH
column to AGE
; you can insert the median income in cases where the INCOME
column is null; you can filter out rows representing outliers in the data or filter columns that have too many missing or identical values.
Additionally you can add new computed attributes in an effort to tease information closer to the surface of the data. This process is referred as Feature Engineering. For example, rather than using the purchase amount, you can create a new attribute: "Number of Times Purchase Amount Exceeds $500 in a 12 month time period." Customers who frequently make large purchases can also be related to customers who respond or don't respond to an offer.
Note:
Oracle Machine Learning supports Automatic Data Preparation (ADP), which greatly simplifies the process of data preparation.
- Clean, join, and select data
- Transform data
- Engineer new features
Related Topics
Parent topic: Machine Learning Process