Load the Data

Create an instance of SpatialDataFrame.

The census dataset is stored in the la_block_groups table in the database. To load it into Python, use a DBSpatialDataset and create an instance of SpatialDataFrame.

import oml
from oraclesai import SpatialDataFrame, DBSpatialDataset
 
block_groups = SpatialDataFrame.create(DBSpatialDataset(table='la_block_groups',
     schema='oml_user'))

The dataset contains information about different regions in the city of Los Angeles, and features such as median_income and house_value provide information about each region's income. Other features provide demographic information about gender, race, and age.

Review the variables (shown in the following table) of the SpatialDataFrame instance and define the columns that represent the target variable, the explanatory variables, and the geometries.

Variable	Description
`MEDIAN_INCOME`	The target variable representing the median income.
`MEAN_AGE`	The average age.
`MEAN_EDUCATION_LEVEL`	Score based on the different education levels listed in the Census table.
`HOUSE_VALUE`	Median value of houses in the region.
`PER_WHITE`	Proportion of the white population in the region.
`PER_BLACK`	Proportion of the black population in the region.

The following code selects a subset of columns from the SpatialDataFrame instance.

X = block_groups[['MEDIAN_INCOME', 
                  'MEAN_AGE', 
                  'MEAN_EDUCATION_LEVEL', 
                  'HOUSE_VALUE', 
                  'INTERNET', 
                  'geometry']]

Define the training, validation, and test sets.

Split the data into training and test sets using the spatial_train_test_split function from oreaclesai.preprocessing. Assign 20% of the data for testing.

from oraclesai.preprocessing import spatial_train_test_split

X_train_valid, X_test, _, _, _, _ = spatial_train_test_split(X, y="MEDIAN_INCOME", 
    test_size=0.2, random_state=32)

Split the remaining 80% of the data again to create the training and validation sets, using 10% for validation and the rest for training. The validation set is helpful to evaluate the model’s performance before using it with the test set.
```
X_train, X_valid, _, _, _, _ = spatial_train_test_split(X_train_valid, y="MEDIAN_INCOME", 
    test_size=0.1, random_state=32)
```