clustering

Format

clustering(table, method, scale=True, key_column=None, columns=None, weights=None, weights_def=None,
   save_weights_as=None, spatial_col=None, crs=None, to_crs=None, plot=None, **kwargs)

Parameters

The parameters for this pre-defined function are described in the following table.

Parameter Description
method A string specifying the clustering algorithm to execute. The options are: DBSCAN, KMEANS ,and AGGLOMERATIVE.
scale If specified, it calls the function oml.create to store a pandas DataFrame containing spatial lag in a table with the specified name.
key_column If defined, the specified column is added to the resulting pandas DataFrame. Otherwise, a column with the index of the DataFrame is attached to the result.
columns An array of strings indicating the features that form the training set.
weights Required when trying to use spatial weights already stored in the data store. Internally it calls the function olm.ds.load. The supported parameters are ds_name and obj_name, indicating the data store name and object name, respectively.
weights_def Required if the parameter weights is not specified. Establishes the relationship between neighboring locations.

This is passed as a json object specifying the type of the weights definition and its parameters. Each parameter is defined in detail in the API Reference documentation.

The following lists the supported types and parameters:

  • KNN: [k]
  • Kernel: [bandwidth, fixed, k, function]
  • DistanceBand: [threshold, p, alpha, binary]
  • Queen
  • Rook
save_weights_as Only used if weights_def is defined. Specifies how the spatial weights are stored in the data store. The value is a json file that determines the parameters of oml.ds.save. The supported parameters are: [ds_name, obj_name, overwrite_ds, append, overwrite_obj, grantable, compression]. Some parameter names slightly differ from those in the oml.ds.save function. The parameter overwrite_obj is used to indicate whether an already existing object should be replaced with the current object.
spatial_col Specifies the column containing the geometries. The column can be specified in the table’s metadata. If not specified, the column name is retrieved from the table.
crs Specifies the Coordinate Reference System. If not specified, it is inferred from the table.
to_crs If specified, the Coordinate Reference System will change to the specified value.
plot A dictionary specifying the properties of the Plot Clusters function. If defined, a plot showing the resulting clusters is included in the response.

Example

This example shows how to run the agglomerative with regionalization algorithm over a given dataset, specifying the number of clusters and the type of spatial weights.

The clustering algorithm is set in the method parameter, while the number of clusters and the spatial weights are defined in the n_clusters and weights_def parameters respectively. The features considered for clustering are specified in the columns parameter.

select *
    from table( 
        pyqEval(
            '{  
                "oml_connect": true, 
                "table": "oml_user.la_block_groups",
                "columns": ["median_income"],
                "method": "AGGLOMERATIVE",
                "n_clusters": 6,
                "key_column": "geoid",
                "weights_def": {"type": "Queen"}
            }',
            '{ "geoid": "VARCHAR2(50)", "label": "NUMBER" }',
            'clustering'
        )
    );

The result contains the index column specified in the key_column parameter and the labels of each row, indicating to which cluster they belong.



You can visualize the clusters using the select IMAGE clause and the oml_graphics_flag parameter set to true. In the following code, the plot parameter indicates that it uses a basemap as background. Also, note that the output format (out_fmt) is set to PNG.

select IMAGE
    from table(
        pyqEval(
            par_lst => '{
            "oml_connect": true,
            "oml_graphics_flag": true,
            "table": "oml_user.la_block_groups",
            "columns": ["median_income"],
            "method": "AGGLOMERATIVE",
            "n_clusters": 6, 
            "key_column": "geoid",
            "weights_def": {"type": "Queen"},
            "plot": {"with_basemap": true}
        }',
        out_fmt => 'PNG',
        scr_name => 'clustering'
    )
);

The result is a map with the observations colored according to the cluster they are assigned. Note that there are six clusters as specified in the n_clusters parameter. By defining spatial weights, the agglomerative clustering algorithm executes regionalization. This means that observations assigned to the same cluster share common characteristics and are geographically connected.