Load the Data

Perform the following steps to load the data:
  1. Create an instance of SpatialDataFrame.
    The census dataset is stored in the la_block_groups table in the database. To load it into Python, use a DBSpatialDataset and create an instance of SpatialDataFrame.
    import oml
    from oraclesai import SpatialDataFrame, DBSpatialDataset
     
    block_groups = SpatialDataFrame.create(DBSpatialDataset(table='la_block_groups',
         schema='oml_user'))

    The dataset contains information about different regions in the city of Los Angeles, and features such as median_income and house_value provide information about each region's income. Other features provide demographic information about gender, race, and age.

  2. Review the variables (shown in the following table) of the SpatialDataFrame instance and define the columns that represent the target variable, the explanatory variables, and the geometries.
    Variable Description
    MEDIAN_INCOME The target variable representing the median income.
    MEAN_AGE The average age.
    MEAN_EDUCATION_LEVEL Score based on the different education levels listed in the Census table.
    HOUSE_VALUE Median value of houses in the region.
    PER_WHITE Proportion of the white population in the region.
    PER_BLACK Proportion of the black population in the region.

    The following code selects a subset of columns from the SpatialDataFrame instance.

    X = block_groups[['MEDIAN_INCOME', 
                      'MEAN_AGE', 
                      'MEAN_EDUCATION_LEVEL', 
                      'HOUSE_VALUE', 
                      'INTERNET', 
                      'geometry']]
  3. Define the training, validation, and test sets.
    1. Split the data into training and test sets using the spatial_train_test_split function from oreaclesai.preprocessing. Assign 20% of the data for testing.
      from oraclesai.preprocessing import spatial_train_test_split
      
      X_train_valid, X_test, _, _, _, _ = spatial_train_test_split(X, y="MEDIAN_INCOME", 
          test_size=0.2, random_state=32)
    2. Split the remaining 80% of the data again to create the training and validation sets, using 10% for validation and the rest for training. The validation set is helpful to evaluate the model’s performance before using it with the test set.
      X_train, X_valid, _, _, _, _ = spatial_train_test_split(X_train_valid, y="MEDIAN_INCOME", 
          test_size=0.1, random_state=32)