oraclesai.clustering.DBScanClustering

class DBScanClustering(eps=None, min_samples=2, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None, spatial_weights_definition=None, use_spatial_weights_distances=True)

DBSCAN is a density-based clustering technique capable of finding clusters of different shapes and sizes from a large amount of data. This algorithm doesn’t require the number of clusters as a parameter. The algorithm starts at any point; if at least min_samples points are within a radius of eps, then all the points in the neighborhood are considered part of the same cluster. Then the process is repeated for all the points in the neighborhood. There are three types of points:

  • Core Point

    At least has min_samples number of points in its neighborhood within the radius eps.

  • Border Point

    It is reachable from a core point, but there are fewer than min_samples number of points within its neighborhood.

  • Noise Point

    It is neither a core point nor a border point; it is a point that is not reachable from any core points.

Regionalization is used to provide a spatial context to the DBSCAN algorithm. This way, observations of the same cluster are similar not only in their attributes, but also in their spatial location.

Parameters:
  • eps – float, default=None. The maximum distance between two samples for one to be considered as in the neighborhood of the other. If it is None, the K-Distance method is used to estimate the best value for eps.

  • min_samples – int, default=None. The number of samples in a neighborhood for a point to be considered as a core point. If it is None, it is estimated using the number of features in the data.

  • metric – str, or callable, default=’euclidean’. The metric used to calculate the distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by :func:sklearn.metrics.pairwise_distances. If metric is ‘precomputed’, X is assumed to be the distance matrix and must be square.

  • metric_params – dict, default=None. Additional arguments for the metric function

  • algorithm – {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’. The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors.

  • leaf_size – int, default=30. Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

  • p – float, default=None. The power of the Minkowski metric to be used to calculate distance between points. If None, then p=2 (equivalent to the Euclidean distance).

  • n_jobs – int, default=None. The number of parallel jobs to run

  • spatial_weights_definition – SpatialWeightsDefinition, default=None. Spatial relationship specification. Defines the criteria used to identify neighbors, for example, KNNWeightsDefinition, DistanceBandWeightsDefinition, etc.

  • use_spatial_weights_distances – bool, default=True. If True, it will use the spatial the weight matrix as distance. If False, it will set the distance to all neighbors to zero.

Methods

__init__([eps, min_samples, metric, ...])

fit(X[, y, geometries, spatial_weights, crs])

Fits a DBSCAN model with the given data and parameters; in case spatial weights were defined, Regionalization is executed, causing elements of the same cluster to be geographically connected.

fit_predict(X[, y, geometries, ...])

Trains the clustering model and returns the labels assigned to each observation.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

Attributes

METRIC_PRECOMPUTED

NON_NEIGHBOR_DISTANCE

eps_

Maximum distance between two samples for one to be considered neighbor of the other.

isoperimetric_quotient_

The isoperimetric quotient (IPQ) for the resulting clusters.

labels_

Array indicating the cluster associated with each sample.

min_samples_

The number of samples in a neighborhood for a point to be considered as a core point.

silhouette_score_

The Silhouette score for the resulting clusters.