Explore the Data

Understand the data by visualizing the first five observations of the training set using the head method.

from oraclesai import enable_geodataframes
enable_geodataframes(z)

z.show(X_train.head())

The output is as shown:

Description of regression_explore_data_head_method.png follows

Description of the illustration regression_explore_data_head_method.png

Define spatial weights to understand the influence of each variable in neighboring locations (by establishing the relationship between neighboring locations).

Use the K-Nearest Neighbor approach, which indicates that for each observation, the nearest K observations are considered neighbors.

from oraclesai.weights import KNNWeightsDefinition

weights_definition = KNNWeightsDefinition(k=10)

Calculate the spatial lag to study the interaction with the neighboring locations.

The spatial lag of an observation represents the average value of a certain feature among its neighbors. For example, the average house value across neighboring locations.

The following code calculates the spatial lag for all the variables in the training set, except the geometries.

from oraclesai.preprocessing import SpatialLagTransformer 

X_spatial_lag = SpatialLagTransformer(weights_definition).fit_transform(X_train)

According to Tobler's first law of geography, everything is related to everything else, but near things are more related than distant things. To understand the relation between features in a specific location, use the correlation between a feature and its spatial lag. For example, a strong positive correlation between the median income and the average income from neighboring locations could indicate an influence on the median income from its neighbors.

The following code displays the correlation matrix of the spatial lag variables and the target variable, where the spatial lag variables have the suffix _LAG.

import numpy as np
import pandas as pd 

# Append the target variable to the spatial lag variables 
X_target_spatial_lag = np.append(X_train["MEDIAN_INCOME"].get_values(), X_spatial_lag, 1) 

# Create a Pandas' DataFrame
columns = ["MEDIAN_INCOME", "MEDIAN_INCOME_LAG", "MEAN_AGE_LAG", "MEAN_EDUCATION_LEVEL_LAG", "HOUSE_VALUE_LAG", "INTERNET_LAG"] 
X_target_spatial_lag_df = pd.DataFrame(data=X_target_spatial_lag, columns=columns)

z.show(X_target_spatial_lag_df.corr())

The output is as shown:

Description of correlation_matrix.png follows

Description of the illustration correlation_matrix.png

Measure the influence of neighbors.

There is a strong positive correlation between the target variable (MEDIAN_INCOME) and its spatial lag (MEDIAN_INCOME_LAG). This indicates that locations with similar income tend to be together, which is an indicator of spatial dependence.

To confirm the presence of spatial dependence, calculate the Moran’s I statistic, which measures spatial autocorrelation.

A positive and significant value indicates the presence of spatial clustering, where regions with similar values (high or low) tend to be together, reflecting the effect of spatial dependence.
A negative and significant value indicates the presence of spatial variance or the checkerboard pattern, reflecting the effect of spatial heterogeneity.

from oraclesai.analysis import MoranITest 
from oraclesai.weights import SpatialWeights 
# Create spatial weights from definition 
spatial_weights = SpatialWeights.create(X_train["geometry"].values, weights_definition) 

# Run the Moran's I test 
moran_test = MoranITest.create(X_train, spatial_weights, column_name="MEDIAN_INCOME") 

# Print the Moran's I and the p-value 
print("Moran's I = ", moran_test.i) 
print("p_value = ", moran_test.p_value)

The Moran’s I statistic is positive and significant, confirming the presence of spatial dependence in the target variable.

Moran's I =  0.5744827266749303
p_value =  0.001

Get the spatial statistics.

Some spatial statistics become available by running the OLS model with spatial diagnostics. To get spatial diagnostics, it is required to define the spatial weights when creating the instance of OLSRegressor.

from oraclesai.regression import OLSRegressor

ols_model = OLSRegressor(weights_definition).fit(X_train, "MEDIAN_INCOME")

Obtain the Moran’s I statistic from the model’s residuals using the moran_res metric from oraclesai.metrics.

from oraclesai.metrics import moran_res 

morans_i, _, p_value = moran_res(ols_model) 

print(f"Moran's I = {morans_i}") 
print(f"p_value = {p_value}")

The positive and significant value of Moran’s I statistic of the residuals indicates the presence of spatial dependence in the residuals, which means that the prediction error of an observation is similar to the prediction error of its neighbors.

Moran's I = 0.2594180201084295
p_value = 9.432690077796932e-203

The two regression models , Spatial Lag Model and Spatial Error Model, include the effect of spatial dependence in their regression equation.

Use the Lagrange Multipliers tests from the spatial diagnostics of the trained OLS model to choose the best model for the data. The Lagrange Multiplier tests for Spatial Lag and Spatial Error are part of oraclesai.metrics.

from oraclesai.metrics import lm_lag, lm_error, rlm_lag, rlm_error
 
print(f"Lagrange Multiplier (lag): {lm_lag(ols_model)}")
print(f"Robust LM (lag): {rlm_lag(ols_model)}")
print(f"Lagrange Multiplier (error): {lm_error(ols_model)}")
print(f"Robust LM (error): {rlm_error(ols_model)}")

Use the robust tests when Lagrange Multiplier tests are significant for Spatial Lag and Spatial Error. Both robust tests are significant, but the value of the statistic for spatial error is much larger, indicating that the Spatial Error model is a better fit for the data.

Lagrange Multiplier (lag): (357.8764476978743, 8.165543828650201e-80)
Robust LM (lag): (10.656323334376838, 0.001096952308135397)
Lagrange Multiplier (error): (904.2345462924114, 1.178375337257614e-198)
Robust LM (error): (557.0144219289139, 3.750470342578867e-123)