5.2 Create AutoML UI Experiment

To use the Oracle Machine Learning AutoML UI, you start by creating an experiment. An experiment is a unit of work that minimally specifies the data source, prediction target, and prediction type. After an experiment runs successfully, it presents you a list of machine learning models in order of model quality according to the metric selected. You can select any of these models for deployment or to generate a notebook. The generated notebook contains Python code using OML4Py and the specific settings AutoML used to produce the model.

To create an experiment, specify the following:
  1. In the Name field, enter a name for the experiment.

    Figure 5-5 Create an AutoML Experiment

    Description of Figure 5-5 follows
    Description of "Figure 5-5 Create an AutoML Experiment"
  2. In the Comments field, enter comments, if any.
  3. In the Data Source field, select the schema and a table or view in that schema. Click the search icon to open the Select Table dialog box. Browse and select a schema, and then select a table from the schema list, which is the data source of your AutoML UI experiment.

    Figure 5-6 Select Table dialog

    Select Table dialog
    1. In the Schema column, select a schema.

      Note:

      While you select the data source, statistics are displayed in the Features grid at the bottom of the experiment page. Busy status is indicated until the computation is complete. The target column that you select in Predict is highlighted in the Features grid.
    2. Depending on the selected schema, the available tables are listed in the Table column. Select the table and click OK.

    Note:

    To create an AutoML experiment for a table or view present in the schema of another user, ensure that you have explicit privileges to access that table or view in the schema. Request the Database Administrator or the owner of the schema to provide you with the privileges to access the table or view. For example:
    grant select on <table> to <user>
  4. In the Predict drop-down list, select the column from the selected table. This is the target for your prediction.
  5. In the Prediction Type field, the prediction type is automatically selected based on your data definition. However, you can override the prediction type from the drop-down list, if data type permits. Supported Prediction Types are:
    • Classification: For non-numeric data type, Classification is selected by default.
    • Regression: For numeric data type, Regression is selected by default.
  6. The Case ID helps in data sampling and dataset split to make the results reproducible between experiments. It also aids in reducing randomness in the results. This is an optional field.
  7. In the Additional Settings section, you can define the following:


    1. Reset: Click Reset to reset the settings to the default values.
    2. Maximum Top Models: Select the maximum number of top models to create. The default is 5 models. You can reduce the number of top models to 2 or 3 since tuning models to get the top one for each algorithm requires additional time. If you want to get the initial results even faster, consider the top recommended algorithm. For this, set the Maximum Top Models to 1. This will tune the model for that algorithm.
    3. Maximum Run Duration: This is the maximum time for which the experiment will be allowed to run. If you do not enter a time, then the experiment will be allowed to run for up to the default, which is 8 hours.
    4. Database Service Level: This is database connection service level and query parallelism level. Default is Low. This results in no parallelism and sets a high runtime limit. You can create many connections with Low database service level. You can also change your database service level to Medium or High.
      • High level gives the greatest parallelism but significantly limits the number of concurrent jobs.
      • Medium level enables some parallelism but allows greater concurrency for job processing.

      Note:

      Changing the database service level setting on the Always Free Tier will have no effect since there is a 1 OCPU limit. However, if you increase the OCPUs allocated to your autonomous database instance, you can increase the Database Service Level to Medium or High.

      Note:

      The Database Service Level setting has no effect on AutoML container level resources.
    5. Model Metric: Select a metric to choose the winning models. The following metrics are supported by AutoML UI:
      • For Classification, the supported metrics are:
        • Balanced Accuracy
        • ROC AUC
        • F1 (with weighted options). The weighted options are weighted, binary, micro and macro.
          • Micro-averaged: Here, all samples equally contribute to the final averaged metric
          • Macro-averaged: Here, all classes equally contribute to the final averaged metric
          • Weighted-averaged: Here, each class' contribution to the average is weighted by its size
        • Precision (with weighted options)
        • Recall (with weighted options)
      • For Regression, the supported metrics are:
        • R2 (default)
        • Negative mean squared error
        • Negative mean absolute error
        • Negative median absolute error
    6. Algorithm: The supported algorithms depend on Prediction Type that you have selected. Click the corresponding checkbox against the algorithms to select it. By default, all the candidate algorithms are selected for consideration as the experiment runs. The supported algorithms for the two Prediction Types:
      • For Classification, the supported algorithms are:
        • Decision Tree
        • Generalized Linear Model
        • Generalized Linear Model (Ridge Regression)
        • Neural Network
        • Random Forest
        • Support Vector Machine (Gaussian)
        • Support Vector Machine (Linear)
      • For Regression, the supported algorithms are:
        • Generalized Linear Model
        • Generalized Linear Model (Ridge Regression)
        • Neural Network
        • Support Vector Machine (Gaussian)
        • Support Vector Machine (Linear)

      Note:

      You can remove algorithms from being considered if you have preferences for particular algorithms, or have specific requirements. For example, if model transparency is essential, then excluding models such as Neural Network would make sense. Note that some algorithms are more compute intensive than others. For example, Naïve Bayes and Decision Tree are normally faster than Support Vector Machine or Neural Network.
    7. Model Name Handling: Here, you have the option to retain the original model name or generate unique model names everytime you run an experiment. By default, this option is deselected.
      Model name handling

      • Create unique names for each run: Select this option to generate model names that are unique. Selecting this option also gives you the choice to select or deselect the option Drop models from the previous run.
        • Drop models from the previous run: Select this option to drop the models that were generated in the prior experiment runs. Deselect this option to retain the models that were generated in the prior runs. These models are available in the user schema.
  8. Expand the Features grid to view the statistics of the selected table. The supported statistics are Distinct Values, Minimum, Maximum, Mean, and Standard Deviation. The supported data sources for Features are tables, views and analytic views. The target column that you selected in Predict is highlighted here. After an experiment run is completed, the Features grid displays an additional column Importance. Feature Importance indicates the overall level of sensitivity of prediction to a particular feature.

    Figure 5-7 Features

    Features
    You can perform the following tasks:
    • Refresh: Click Refresh to fetch all columns and statistics for selected data source.
    • View Importance: Hover your cursor over the horizontal bar under Importance to view the value of Feature Importance for the variables. The value is always depicted in the range 0 to 1, with values closer to 1 being more important.
  9. When you complete defining the experiment, the Start and Save buttons are enabled.

    Figure 5-8 Start Experiment Options

    Start Experiment Options
    • Click Start to run the experiment and start the AutoML UI workflow, which is displayed in the progress bar. Here, you have the option to select:
      1. Faster Results: Select this option if you want to get candidate models sooner, possibly at the expense of accuracy. This option works with a smaller set of the hyperparamter combinations, and hence yields faster result.
      2. Better Accuracy: Select this option if you want more pipeline combinations to be tried for possibly more accurate models. A pipeline is defined as an algorithm, selected data feature set, and set of algorithm hyperparameters.

        Note:

        This option works with the broader set of hyperparameter options recommended by the internal meta-learning model. Selecting Better Accuracy will take longer to run your experiment, but may provide models with more accuracy.

      Once you start an experiment, the progress bar appears displaying different icons to indicate the status of each stage of the machine learning workflow in the AutoML experiment. The progress bar also displays the time taken to complete the experiment run. To view the message details, click on the respective message icons.

    • Click Save to save the experiment, and run it later.
    • Click Cancel to cancel the experiment creation.

5.2.1 Supported Data Types for AutoML UI Experiments

When creating an AutoML experiment, you must specify the data source and the target of the experiment. This topic lists the data types for Python and SQL that are supported by AutoML experiments.

Table 5-1 Supported Data Types by AutoML Experiments

Data Types SQL Data Types Python Data Types
Numerical NUMBER, INTEGER, FLOAT, BINARY_DOUBLE, NUMBER, BINARY_FLOAT, DM_NESTED_NUMERICALS, DM_NESTED_BINARY_DOUBLES, DM_NESTED_BINARY_FLOATS

INTEGER, FLOAT(NUMBER, BINARY_DOUBLE, BINARY_FLOAT)

Categorical

CHAR, VARCHAR2, DM_NESTED_CATEGORICALS

STRING(VARCHAR2, CHAR, CLOB)

Unstructured Text

CHAR, VARCHAR2, CLOB, BLOB, BFILE

BYTES (RAW, BLOB)